From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 References: <874lbfxijq.fsf@elephly.net> <87va3mr6fl.fsf@elephly.net> In-Reply-To: <87va3mr6fl.fsf@elephly.net> From: zimoun Date: Fri, 4 Jan 2019 18:48:34 +0100 Message-ID: Subject: Re: [GWL] (random) next steps? Content-Type: multipart/mixed; boundary="000000000000acfbea057ea57d24" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+kyle=kyleam.com@gnu.org Sender: "Guix-devel" To: Ricardo Wurmus Cc: Guix Devel , gwl-devel@gnu.org List-ID: --000000000000acfbea057ea57d24 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ricardo, Happy New Year !! > We can connect a graph by joining the inputs of one process with the > outputs of another. > > With a content addressed store we would run processes in isolation and > map the declared data inputs into the environment. Instead of working > on the global namespace of the shared file system we can learn from Guix > and strictly control the execution environment. After a process has run > to completion, only files that were declared as outputs end up in the > content addressed store. > > A process could declare outputs like this: > > (define the-process > (process > (name 'foo) > (outputs > '((result "path/to/result.bam") > (meta "path/to/meta.xml"))))) > > Other processes can then access these files with: > > (output the-process 'result) > > i.e. the file corresponding to the declared output =E2=80=9Cresult=E2=80= =9D of the > process named by the variable =E2=80=9Cthe-process=E2=80=9D. Ok, in this spirit? https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rule-depen= dencies >From my point of view, there is 2 different paths: 1- the inputs-outputs are attached to the process/rule/unit 2- the processes/rules/units are a pure function and then the `workflow' describes how to glue them together. If I understand well, Snakemake is about the path 1-. From the inputs-outputs chain, the graph is deduced. Attached, a dummy example with snakemake where I re-use one `shell' between 2 different rules. It is ugly because it works with strings. And the rule `filter' cannot be used without the rule `move_1' since the two rules are explicitly connected by their input-output. The other approach is to define a function that returns a process. Then one needs to specify the graph with the `restrictions', other said which function composes with which one. However, because we also want to track the intermediate outputs, the inputs-outputs is specified for each process; should be optional, isn't it? If I understand well, it is one possible approach of Dynamic Workflows by GWL: https://www.guixwl.org/beyond-started On one hand, from the path 1-, it is hard to reuse the process/rule because the composition is hard-coded in the inputs-outputs (duplication of the same process/rule with different inputs-outputs). The graph is written by the user when it writes the inputs-outputs chain. On the other hand, from the path 2-, it is difficult to provide both the inputs-outputs to the function and also the graph without duplicate some code. I do not have the mind really clear and I have no idea how to achieve the idea below of the functional paradigm. The process/rule/unit is function with free inputs-outputs (argument or variable) and it returns a process. The workflow is a scope where these functions are combined through some inputs-outputs. For example, let define 2 processes: move and filter. (define* (move in out #:optional (opt "")) (process (package-inputs `(("mv" ,mv))) (input in) (output out) (procedure `(system ,(string-append " mv " opt " " input output))))) (define (filter in out) (process (package-inputs `(("sed" ,sed))) (input in) (output out) (procedure `(system ,(string-append "sed '1d' " input " > " output))))) Then let create the workflow that encodes the graph: (define wkflow:move->filter->move (workflow (let ((tmpA (temp-file)) (tmpB (temp-file))) (processes `((,move "my-input" ,tmpA) (,filter ,tmpA ,tmpB) (,move ,tmpB "my-output" " -v ")))))) >From the `processes', it should be nice to deduce the graph. I am not sure it is possible... Even if it lacks which one is the entry point. But it should be fixed by the `input' and `output' field of `workflow'. Since the move and filter are just pure function, one can easily reuse them and e.g. apply in a different order: (define wkflow:filter->move (workflow (let ((tmp (temp-file))) (processes `((,move ,tmp "one-output") (,filter "one-input" ,tmp)))))) As you said, one thing should also be: (processes `((,move ,(output filter) "one-output") (,filter "one-input" ,(temp-file #:hold #t))) Do you think it is doable? How hard should be? > The question here is just how far we want to take the idea of =E2=80=9Cco= ntent > addressed=E2=80=9D =E2=80=93 is it enough to take the hash of all inputs = or do we need > to compute the output hash, which could be much more expensive? Yes, I agree. Moreover, if the output is hash, then the hash should depend on the hash of the inputs and of the hash of the tools, isn't it? To me, once the workflow is computed, one is happy with their results. Then after a couple of months or years, one still has a copy of the working folder but they are not able to find how they have been computed: which version of the tools, the binaries is not working anymore, etc. Therefore, it should be easy to extract from the results how they have been computed: version, etc. Last, is it useful to write on disk the intermediate files if they are not stored? In the tread [0], we discussed the possibility to stream the pipes. Let say, the simple case: filter input > filtered quality filtered > output and the piped version is better is you do not mind about the filtered file: filter input | quality > ouput However, the classic pipe does not fit for this case: filter input_R1 > R1_filtered filter input_R2 > R2_filtered align R1_filtered R2_filtered > output_aligned In general, one is not interested to conserve the files R{1,2}_filtered. So why spend time to write them on disk and to hash them. In other words, is it doable to stream the `processes' at the process level= ? It is different point of view, but it reaches the same aim, I guess Last, could we add a GWL session to the before-FOSDEM days? What do you think? Thank you. All the best, simon [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html --000000000000acfbea057ea57d24 Content-Type: application/octet-stream; name="func.smk" Content-Disposition: attachment; filename="func.smk" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_jqi5yl2g0 CgoKcnVsZSBvdXQ6CiAgICBpbnB1dDoKICAgICAgICAib3V0cHV0LnR4dCIKCgojIyMKIwojIEdl bmVyYXRlIGZha2UgZGF0YQojICh0b3AgaGVyZSBiZWNhdXNlIFNuYWtlIGlzIG5vdCBkZWNsYXJh dGl2ZSBsYW5ndWFnZSkKIwpydWxlIGZha2VfZGF0YToKICAgIG91dHB1dDoKICAgICAgICAiaW5p dC50eHQiCiAgICBzaGVsbDoKICAgICAgICIiIgogICAgICAgZWNobyAtZSAnZmlyc3QgbGluZVxu b2shJyA+IHtvdXRwdXR9CiAgICAgICAiIiIKIwojIwoKCiMjIwojCiMgRXhhbXBsZSBvZiByZS11 c2FibGUgcHJvY2Vzc2luZyBiZXR3ZWVuIHJ1bGVzCiMKZGVmIG1vdmUoaW5wdXRzLCBvdXRwdXRz LCBwYXJhbXM9Tm9uZSk6CiAgICAiIiIKICAgIE1vdmUgaW5wdXRzIHRvIG91cHV0cy4KCiAgICBO b3RlOiBwYXJhbXMgaXMgb3B0aW9uYWwgYW5kIHByb3ZpZGVzIG9wdGlvbnMgb2YgdGhlIG12IGNv bW1hbmQuCiAgICAiIiIKICAgIHRyeToKICAgICAgICBvcHRpb25zID0gcGFyYW1zLm9wdGlvbnMK ICAgIGV4Y2VwdDoKICAgICAgICBvcHRpb25zID0gJycKICAgICAgICBwYXNzCiAgICBjbWQgPSAi IiIKCiAgICBtdiB7b3B0aW9uc30ge2lucHV0c30ge291dHB1dHN9CgogICAgIiIiLmZvcm1hdChp bnB1dHM9aW5wdXRzLAogICAgICAgICAgICAgICBvdXRwdXRzPW91dHB1dHMsCiAgICAgICAgICAg ICAgIG9wdGlvbnM9b3B0aW9ucykKICAgIHJldHVybiBjbWQKIwojIyMKCgojIyMKIwojIEJlY2F1 c2UgUHl0aG9uIHJvY2tzISA7LSkKIwpkZWYgZ2VuZXJhdG9yKCk6CiAgICAiIiIKICAgIFNpbXBs ZSBnZW5lcmF0b3Igb2YgdGVtcG9yYXJ5IG5hbWUuCgogICAgRXhhbXBsZSBvZiB1c2U6CgogICAg ID4gbmFtZSA9IGdlbmVyYXRvcigpCiAgICAgPiBuZXh0KG5hbWUpCiAgICAgIDAKICAgICA+IG5l eHQobmFtZSkKICAgICAgMQogICAgZXRjLgogICAgIiIiCiAgICBpID0gMAogICAgd2hpbGUgVHJ1 ZToKICAgICAgICB5aWVsZCBzdHIoaSkgKyAnLnRtcCcKICAgICAgICBpICs9IDEKbmFtZSA9IGdl bmVyYXRvcigpCiMKIyMjCgoKCiMjIwojCiMgVGhlIGludGVybmFsIHJ1bGVzCiMKIyMjCgpydWxl IG1vdmVfMToKICAgIGlucHV0OgogICAgICAgIHtydWxlcy5mYWtlX2RhdGEub3V0cHV0fQogICAg b3V0cHV0OgogICAgICAgIHRlbXAobmV4dChuYW1lKSkKICAgIHBhcmFtczoKICAgICAgICBvcHRp b25zID0gJy12JywKICAgIHJ1bjoKICAgICAgICBzaGVsbChtb3ZlKGlucHV0LCBvdXRwdXQsIHBh cmFtcykpCgoKcnVsZSBmaWx0ZXI6CiAgICBpbnB1dDoKICAgICAgICB7cnVsZXMubW92ZV8xLm91 dHB1dH0KICAgIG91dHB1dDoKICAgICAgICB0ZW1wKG5leHQobmFtZSkpCiAgICBzaGVsbDoKICAg ICAgICAiIiIKCiAgICAgICAgc2VkICcxZCcge2lucHV0fSA+IHtvdXRwdXR9CgogICAgICAgICIi IgoKcnVsZSBtb3ZlXzI6CiAgICBpbnB1dDoKICAgICAgICB7cnVsZXMuZmlsdGVyLm91dHB1dH0K ICAgIG91dHB1dDoKICAgICAgICB7cnVsZXMub3V0LmlucHV0fQogICAgcnVuOgogICAgICAgIHNo ZWxsKG1vdmUoaW5wdXQsIG91dHB1dCkpCg== --000000000000acfbea057ea57d24--