Hi Ricardo, Happy New Year !! > We can connect a graph by joining the inputs of one process with the > outputs of another. > > With a content addressed store we would run processes in isolation and > map the declared data inputs into the environment. Instead of working > on the global namespace of the shared file system we can learn from Guix > and strictly control the execution environment. After a process has run > to completion, only files that were declared as outputs end up in the > content addressed store. > > A process could declare outputs like this: > > (define the-process > (process > (name 'foo) > (outputs > '((result "path/to/result.bam") > (meta "path/to/meta.xml"))))) > > Other processes can then access these files with: > > (output the-process 'result) > > i.e. the file corresponding to the declared output “result” of the > process named by the variable “the-process”. Ok, in this spirit? https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rule-dependencies From my point of view, there is 2 different paths: 1- the inputs-outputs are attached to the process/rule/unit 2- the processes/rules/units are a pure function and then the `workflow' describes how to glue them together. If I understand well, Snakemake is about the path 1-. From the inputs-outputs chain, the graph is deduced. Attached, a dummy example with snakemake where I re-use one `shell' between 2 different rules. It is ugly because it works with strings. And the rule `filter' cannot be used without the rule `move_1' since the two rules are explicitly connected by their input-output. The other approach is to define a function that returns a process. Then one needs to specify the graph with the `restrictions', other said which function composes with which one. However, because we also want to track the intermediate outputs, the inputs-outputs is specified for each process; should be optional, isn't it? If I understand well, it is one possible approach of Dynamic Workflows by GWL: https://www.guixwl.org/beyond-started On one hand, from the path 1-, it is hard to reuse the process/rule because the composition is hard-coded in the inputs-outputs (duplication of the same process/rule with different inputs-outputs). The graph is written by the user when it writes the inputs-outputs chain. On the other hand, from the path 2-, it is difficult to provide both the inputs-outputs to the function and also the graph without duplicate some code. I do not have the mind really clear and I have no idea how to achieve the idea below of the functional paradigm. The process/rule/unit is function with free inputs-outputs (argument or variable) and it returns a process. The workflow is a scope where these functions are combined through some inputs-outputs. For example, let define 2 processes: move and filter. (define* (move in out #:optional (opt "")) (process (package-inputs `(("mv" ,mv))) (input in) (output out) (procedure `(system ,(string-append " mv " opt " " input output))))) (define (filter in out) (process (package-inputs `(("sed" ,sed))) (input in) (output out) (procedure `(system ,(string-append "sed '1d' " input " > " output))))) Then let create the workflow that encodes the graph: (define wkflow:move->filter->move (workflow (let ((tmpA (temp-file)) (tmpB (temp-file))) (processes `((,move "my-input" ,tmpA) (,filter ,tmpA ,tmpB) (,move ,tmpB "my-output" " -v ")))))) From the `processes', it should be nice to deduce the graph. I am not sure it is possible... Even if it lacks which one is the entry point. But it should be fixed by the `input' and `output' field of `workflow'. Since the move and filter are just pure function, one can easily reuse them and e.g. apply in a different order: (define wkflow:filter->move (workflow (let ((tmp (temp-file))) (processes `((,move ,tmp "one-output") (,filter "one-input" ,tmp)))))) As you said, one thing should also be: (processes `((,move ,(output filter) "one-output") (,filter "one-input" ,(temp-file #:hold #t))) Do you think it is doable? How hard should be? > The question here is just how far we want to take the idea of “content > addressed” – is it enough to take the hash of all inputs or do we need > to compute the output hash, which could be much more expensive? Yes, I agree. Moreover, if the output is hash, then the hash should depend on the hash of the inputs and of the hash of the tools, isn't it? To me, once the workflow is computed, one is happy with their results. Then after a couple of months or years, one still has a copy of the working folder but they are not able to find how they have been computed: which version of the tools, the binaries is not working anymore, etc. Therefore, it should be easy to extract from the results how they have been computed: version, etc. Last, is it useful to write on disk the intermediate files if they are not stored? In the tread [0], we discussed the possibility to stream the pipes. Let say, the simple case: filter input > filtered quality filtered > output and the piped version is better is you do not mind about the filtered file: filter input | quality > ouput However, the classic pipe does not fit for this case: filter input_R1 > R1_filtered filter input_R2 > R2_filtered align R1_filtered R2_filtered > output_aligned In general, one is not interested to conserve the files R{1,2}_filtered. So why spend time to write them on disk and to hash them. In other words, is it doable to stream the `processes' at the process level? It is different point of view, but it reaches the same aim, I guess Last, could we add a GWL session to the before-FOSDEM days? What do you think? Thank you. All the best, simon [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html