* Next steps for the GWL @ 2019-05-29 13:47 Ricardo Wurmus 2019-06-03 15:16 ` zimoun ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Ricardo Wurmus @ 2019-05-29 13:47 UTC (permalink / raw) To: gwl-devel Hi, I’m going to use the GWL in the next few days to rewrite the PiGx RNAseq pipeline from Snakemake. This will likely show me what features are still missing from the GWL and what implemented features are awkward to use. My goals for the future (in no particular order) are as follows: * add support for running workflows from a file (without GUIX_WORKFLOW_PATH) * replace the “qsub”-based Grid Engine execution engine with proper DRMAA bindings * add support for running remote jobs via Guile SSH * add support for spawning jobs on AWS or Azure or via Kubernetes * tighter integration with Guix features, e.g. to export a container image per process via “guix pack” or to pack up the whole workflow as a relocatable executable. * inversion of control: enable workflow designers to use the GWL as a library, so that the “guix workflow” user interface does not need to be used at all (see PiGx for an example). * explore the use of inferiors — the GWL should be usable with any version of Guix that may be installed, not just the version that was used at compilation time. Can we use “guix repl” and inferiors, perhaps? * add support for executing processes in isolated environments (containers) — this requires a better understanding of process inputs. What do you think about this? Who would like to help flesh out these ideas? -- Ricardo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus @ 2019-06-03 15:16 ` zimoun 2019-06-03 16:18 ` Ricardo Wurmus 2019-06-06 3:19 ` Kyle Meyer 2019-06-12 9:46 ` Ricardo Wurmus 2 siblings, 1 reply; 16+ messages in thread From: zimoun @ 2019-06-03 15:16 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Hi Ricardo, On Wed, 29 May 2019 at 16:41, Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> wrote: > > I’m going to use the GWL in the next few days to rewrite the PiGx RNAseq > pipeline from Snakemake. This will likely show me what features are > still missing from the GWL and what implemented features are awkward to > use. Awesome! > * tighter integration with Guix features, e.g. to export a container > image per process via “guix pack” or to pack up the whole workflow as > a relocatable executable. Yes! Awesome. Relocatable tarballs. Docker images. Singularity one. And maybe generate one pack (docker) per process and something to glue together, e.g., http://www.genouest.org/godocker/ > * explore the use of inferiors — the GWL should be usable with any > version of Guix that may be installed, not just the version that was > used at compilation time. Can we use “guix repl” and inferiors, > perhaps? For reproducibility, a Guix commit should be provided and a `guix pull` (inferiors) should be used. For example, the output of `guix describe -f channels` should be used, either with an option, either directly in the Scheme/Wisp workflow file with a new keyword. > > * add support for executing processes in isolated environments > (containers) — this requires a better understanding of process inputs. Maybe this is the same story than the GoDocker above. Talking about ideas: - what about the Content Adressable Store? - what about a bridge with CWL? Thank you to revive the list. :-) All the best, simon ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-03 15:16 ` zimoun @ 2019-06-03 16:18 ` Ricardo Wurmus 2019-06-06 11:07 ` zimoun 0 siblings, 1 reply; 16+ messages in thread From: Ricardo Wurmus @ 2019-06-03 16:18 UTC (permalink / raw) To: zimoun; +Cc: gwl-devel Hi simon, >> * tighter integration with Guix features, e.g. to export a container >> image per process via “guix pack” or to pack up the whole workflow as >> a relocatable executable. > > Yes! Awesome. > Relocatable tarballs. Docker images. Singularity one. > > And maybe generate one pack (docker) per process and something to glue > together, e.g., > http://www.genouest.org/godocker/ Generating one “containe image” per process is a desirable goal (even though it seems a little wasteful). I don’t know how godocker fits into this. The home page says; It is a batch computing/cluster management tool using Docker as execution/isolation system. It can be seen like Sun Grid Engine/Torque/... The software does not manage however itself the dispatch of the commands on the remote nodes. For this, it integrates with container management tools (Docker Swarm, Apache Mesos, ...) It acts as an additional layer above those tools on multiple user systems where users do not have Docker priviledges or knowledge. Can we directly support these container management tools? I’d like to make GWL workflows very portable, so that there are only few runtime requirements. Depending on a cluster management tool to be configured would be counter to this goal. > Talking about ideas: > - what about the Content Adressable Store? This already exists, but I’m not sure it’s sufficient. > - what about a bridge with CWL? I’m open to this idea, but it would need to be well-defined. What does it really mean? Generating CWL files from GWL workflows? That really shouldn’t be too hard. Anything else, however, is hard for me to imagine. -- Ricardo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-03 16:18 ` Ricardo Wurmus @ 2019-06-06 11:07 ` zimoun 2019-06-06 12:19 ` Ricardo Wurmus 0 siblings, 1 reply; 16+ messages in thread From: zimoun @ 2019-06-06 11:07 UTC (permalink / raw) To: Ricardo Wurmus, Pjotr Prins; +Cc: gwl-devel Hi, (+ Pjotr because I am sure he has an interesting opinion but not sure he closely reads this list ;-) On Mon, 3 Jun 2019 at 18:18, Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> wrote: > > - what about a bridge with CWL? > > I’m open to this idea, but it would need to be well-defined. What does > it really mean? Generating CWL files from GWL workflows? That really > shouldn’t be too hard. Anything else, however, is hard for me to > imagine. Well, I point out previous threads about this topic: https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00428.html https://lists.gnu.org/archive/html/gwl-devel/2019-02/msg00019.html 1- Generating CWL from GWL should be nice. It should ease the use of already in-place platform and tools (AWS, etc.) 2- Use CWL as a process. A lot of work have been done by Pjotr and reported here [1] [1] https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/ All the best, simon ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 11:07 ` zimoun @ 2019-06-06 12:19 ` Ricardo Wurmus 2019-06-06 13:23 ` Pjotr Prins 0 siblings, 1 reply; 16+ messages in thread From: Ricardo Wurmus @ 2019-06-06 12:19 UTC (permalink / raw) To: zimoun; +Cc: Pjotr Prins, gwl-devel Hi simon, > (+ Pjotr because I am sure he has an interesting opinion but not sure > he closely reads this list ;-) > > On Mon, 3 Jun 2019 at 18:18, Ricardo Wurmus > <ricardo.wurmus@mdc-berlin.de> wrote: > >> > - what about a bridge with CWL? >> >> I’m open to this idea, but it would need to be well-defined. What does >> it really mean? Generating CWL files from GWL workflows? That really >> shouldn’t be too hard. Anything else, however, is hard for me to >> imagine. > > Well, I point out previous threads about this topic: > > https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00428.html > https://lists.gnu.org/archive/html/gwl-devel/2019-02/msg00019.html > > 1- > Generating CWL from GWL should be nice. It should ease the use of > already in-place platform and tools (AWS, etc.) Generating CWL from GWL should be easy, but it’s also not all that useful. The GWL takes care of software deployment, so not only should we generate CWL files but also generate (and upload?) Docker images and make the CWL file reference them. The tooling for CWL… seems a little less substantial and focused than it first appears. The cwltool can only run CWL workflows locally — no DRMAA, no AWS. All the other runners that are listed on the CWL website are either very limited or very large environments where CWL execution is not necessarily the primary purpose (cf Galaxy or Arvados). Still, I think it’s the most meanigful connection the GWL can have with the CWL: using the GWL as a high-level representation which “compiles” down to a lower-level representation of CWL + Docker images when needed. > 2- > Use CWL as a process. A lot of work have been done by Pjotr and > reported here [1] > > > [1] https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/ Yes, this works, of course, but that’s a level of integration that’s extremely limited, in my opinion. Using Guix with the CWL is fine as the blog post demonstrates, but there is very little to be gained and much to be lost when embedding CWL in a GWL workflow. The only thing this enables is reusing existing CWL workflows as a GWL “process”. There is no meaningful integration – the embedded CWL workflow is a second-class citizen that cannot benefit from any of the GWL features. If the CWL workflow is connected to the GWL via cwltool then the only way to run the workflow on a DRMAA-supported cluster or a bunch of SSH-connected servers, or AWS EC2 instances is to wrap it up in a GWL context. The GWL treats the process as its smallest unit of organisation, so a CWL workflow that’s run as a GWL process cannot really be scaled. If the user has a different CWL execution environment (such as an Arvados installation), the CWL workflow embedded in the GWL will not be able to make use of it. It would forever be tied to the particular version of cwltool in Guix. I’d rather not advocate this use of the CWL in the GWL. It might sound good (“The GWL is compatible with the CWL!”), but ultimately it’s a really awkward connection that is bound to lead to a great deal of frustration. Does this make sense? I don’t want to be dismissive. It would be great if we could come up with something that’s mutually beneficial for CWL users and GWL users alike, but I feel that our options are very limited. I’m still open to ideas and use-case scenarios. -- Ricardo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 12:19 ` Ricardo Wurmus @ 2019-06-06 13:23 ` Pjotr Prins 0 siblings, 0 replies; 16+ messages in thread From: Pjotr Prins @ 2019-06-06 13:23 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel On Thu, Jun 06, 2019 at 02:19:04PM +0200, Ricardo Wurmus wrote: > > Hi simon, > > > (+ Pjotr because I am sure he has an interesting opinion but not sure > > he closely reads this list ;-) I read it :) > > On Mon, 3 Jun 2019 at 18:18, Ricardo Wurmus > > <ricardo.wurmus@mdc-berlin.de> wrote: > > > >> > - what about a bridge with CWL? > >> > >> I’m open to this idea, but it would need to be well-defined. What does > >> it really mean? Generating CWL files from GWL workflows? That really > >> shouldn’t be too hard. Anything else, however, is hard for me to > >> imagine. > > > > Well, I point out previous threads about this topic: > > > > https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00428.html > > https://lists.gnu.org/archive/html/gwl-devel/2019-02/msg00019.html > > > > 1- > > Generating CWL from GWL should be nice. It should ease the use of > > already in-place platform and tools (AWS, etc.) > > Generating CWL from GWL should be easy, but it’s also not all that > useful. The GWL takes care of software deployment, so not only should > we generate CWL files but also generate (and upload?) Docker images and > make the CWL file reference them. > > The tooling for CWL… seems a little less substantial and focused than it > first appears. The cwltool can only run CWL workflows locally — no > DRMAA, no AWS. All the other runners that are listed on the CWL website > are either very limited or very large environments where CWL execution > is not necessarily the primary purpose (cf Galaxy or Arvados). > > Still, I think it’s the most meanigful connection the GWL can have with > the CWL: using the GWL as a high-level representation which “compiles” > down to a lower-level representation of CWL + Docker images when needed. > > > 2- > > Use CWL as a process. A lot of work have been done by Pjotr and > > reported here [1] > > > > > > [1] https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/ > > Yes, this works, of course, but that’s a level of integration that’s > extremely limited, in my opinion. Using Guix with the CWL is fine as > the blog post demonstrates, but there is very little to be gained and > much to be lost when embedding CWL in a GWL workflow. The only thing > this enables is reusing existing CWL workflows as a GWL “process”. > There is no meaningful integration – the embedded CWL workflow is a > second-class citizen that cannot benefit from any of the GWL features. > > If the CWL workflow is connected to the GWL via cwltool then the only > way to run the workflow on a DRMAA-supported cluster or a bunch of > SSH-connected servers, or AWS EC2 instances is to wrap it up in a GWL > context. The GWL treats the process as its smallest unit of > organisation, so a CWL workflow that’s run as a GWL process cannot > really be scaled. If the user has a different CWL execution environment > (such as an Arvados installation), the CWL workflow embedded in the GWL > will not be able to make use of it. It would forever be tied to the > particular version of cwltool in Guix. > > I’d rather not advocate this use of the CWL in the GWL. It might sound > good (“The GWL is compatible with the CWL!”), but ultimately it’s a > really awkward connection that is bound to lead to a great deal of > frustration. > > Does this make sense? Yes. Personally I also think the CWL is flawed. It overcomplicates things and the reference implementation is pretty crappy. If we get GWL to work in my environment I would think it a breath of fresh air. Not to say that the CWL does not have some bad ideas (triple negative). You can read my blog for that. > I don’t want to be dismissive. It would be great if we could come > up > with something that’s mutually beneficial for CWL users and GWL users > alike, but I feel that our options are very limited. I’m still open to > ideas and use-case scenarios. We can probably just mix the two. I mean the main benefit of the CWL is *sharing* workflows that have been described by others. That is the point of the CWL and even at that it does not prove really great (after all this time how much is shared?). Since CWL and GWL can use the same file system and job submission system I think it is pretty OK for GWL to ignore the CWL and either send data from one to the other or execute CWL pipelines from GWL. Both possible without much work. Pj. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus 2019-06-03 15:16 ` zimoun @ 2019-06-06 3:19 ` Kyle Meyer 2019-06-06 10:11 ` Ricardo Wurmus 2019-06-12 9:46 ` Ricardo Wurmus 2 siblings, 1 reply; 16+ messages in thread From: Kyle Meyer @ 2019-06-06 3:19 UTC (permalink / raw) To: Ricardo Wurmus, gwl-devel Hi Ricardo, Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: > My goals for the future (in no particular order) are as follows: Thanks for sharing. Looks very exciting! > [...] > > * inversion of control: enable workflow designers to use the GWL as a > library, so that the “guix workflow” user interface does not need to > be used at all (see PiGx for an example). Sounds like a good direction to go, and I imagine it'd facilitate building wrappers that extend GWL. One of the things I'd love to do with GWL is to make it play well with git-annex, something that would almost certainly be too specific for GWL itself. For example * Make data caching git-annex aware. When deciding to recompute data files, GWL avoids computing the hash of data files, using scripts as the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net. But if the user is tracking data files with git-annex, getting the hash of data files becomes less expensive because we can ask git-annex for the hash it has already computed. * Support getting annex data files on demand (i.e. 'git annex get') if they are needed as inputs. > * explore the use of inferiors — the GWL should be usable with any > version of Guix that may be installed, not just the version that was > used at compilation time. Can we use “guix repl” and inferiors, > perhaps? For my personal use, I'd almost always want to pin an analysis workflow at a certain Guix version, so making it easy to use inferiors in the workflow would be great. > * add support for executing processes in isolated environments > (containers) — this requires a better understanding of process inputs. This is another one I'm especially excited about. Functionality-wise, are you imagining essentially matching the options available for 'guix environment --container ...'? -- Kyle ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 3:19 ` Kyle Meyer @ 2019-06-06 10:11 ` Ricardo Wurmus 2019-06-06 10:55 ` zimoun 2019-06-06 15:07 ` Kyle Meyer 0 siblings, 2 replies; 16+ messages in thread From: Ricardo Wurmus @ 2019-06-06 10:11 UTC (permalink / raw) To: Kyle Meyer; +Cc: gwl-devel Hi Kyle, thanks for your comments! > One of the things I'd love to do > with GWL is to make it play well with git-annex, something that would > almost certainly be too specific for GWL itself. For example > > * Make data caching git-annex aware. When deciding to recompute data > files, GWL avoids computing the hash of data files, using scripts as > the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net. > But if the user is tracking data files with git-annex, getting the > hash of data files becomes less expensive because we can ask > git-annex for the hash it has already computed. > > * Support getting annex data files on demand (i.e. 'git annex get') if > they are needed as inputs. I wonder what the protocol should look like. Should a workflow explicitly request a “git annex” file or should it be up to the person running the workflow, i.e. when “git annex” has been configured to be the cache backend it would simply look up the declared input/output files there. I suppose the answers would equally apply to using IPFS as a cache. >> * add support for executing processes in isolated environments >> (containers) — this requires a better understanding of process inputs. > > This is another one I'm especially excited about. Functionality-wise, > are you imagining essentially matching the options available for 'guix > environment --container ...'? So far this is all I’ve got: --8<---------------cut here---------------start------------->8--- diff --git a/gwl/processes.scm b/gwl/processes.scm index beb61cc..264807f 100644 --- a/gwl/processes.scm +++ b/gwl/processes.scm @@ -19,13 +19,19 @@ #:use-module ((guix derivations) #:select (derivation->output-path build-derivations)) + #:use-module ((guix packages) + #:select (package-file)) #:use-module (guix gexp) - #:use-module ((guix monads) #:select (mlet return)) + #:use-module ((guix monads) #:select (mlet mapm return)) #:use-module (guix records) #:use-module ((guix store) #:select (run-with-store with-store %store-monad)) + #:use-module ((guix modules) + #:select (source-module-closure)) + #:use-module (gnu system file-systems) + #:use-module (gnu build linux-container) #:use-module (ice-9 format) #:use-module (ice-9 match) #:use-module (srfi srfi-1) @@ -276,6 +282,54 @@ plain S-expression." (call process code))) (whatever (error (format #f "unsupported procedure: ~a\n" whatever))))) +;; WIP +(define (containerize exp process) + "Wrap EXP, an S-expression or G-expression, in a G-expression that +causes EXP to be run in a container according to the requirements +specified in PROCESS." + (let* ((package-dirs + (with-store store + (run-with-store store + (mapm %store-monad package-file + (process-package-inputs process))))) + (data-inputs + (process-data-inputs process)) + (output-dirs + (delete-duplicates + (map dirname (process-outputs process)))) + (input-mappings + (map (lambda (location) + (file-system-mapping + (source location) + (target location) + (writable? #f))) + (lset-difference string=? + (append package-dirs + data-inputs) + output-dirs))) + (output-mappings + (map (lambda (dir) + (file-system-mapping + (source dir) + (target dir) + (writable? #t))) + output-dirs)) + (specs + (map (compose file-system->spec + file-system-mapping->bind-mount) + (append input-mappings + output-mappings)))) + (with-imported-modules (source-module-closure + '((gnu build linux-container) + (gnu system file-systems))) + #~(begin + (use-modules (gnu build linux-container) + (gnu system file-systems)) + (call-with-container (append %container-file-systems + (map spec->file-system + '#$specs)) + (lambda () #$exp)))))) + ;;; --------------------------------------------------------------------------- ;;; ADDITIONAL FUNCTIONS ;;; --------------------------------------------------------------------------- --8<---------------cut here---------------end--------------->8--- This means that it can map file systems into the container and then run the process expression in that environment. One thing I’m not happy about is that I can only mount directories, and not individual files that have been declared as inputs. I’d like to have more fine-grained access. I suppose it might be possible to mount just the relevant parts of the GWL cache, but I need to play with this to better understand what the desired behaviour would be. -- Ricardo ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 10:11 ` Ricardo Wurmus @ 2019-06-06 10:55 ` zimoun 2019-06-06 11:59 ` Ricardo Wurmus 2019-06-06 13:44 ` Pjotr Prins 2019-06-06 15:07 ` Kyle Meyer 1 sibling, 2 replies; 16+ messages in thread From: zimoun @ 2019-06-06 10:55 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: Kyle Meyer, gwl-devel Hi, On Thu, 6 Jun 2019 at 12:11, Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> wrote: > > One of the things I'd love to do > > with GWL is to make it play well with git-annex, something that would > > almost certainly be too specific for GWL itself. For example > > > > * Make data caching git-annex aware. When deciding to recompute data > > files, GWL avoids computing the hash of data files, using scripts as > > the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net. > > But if the user is tracking data files with git-annex, getting the > > hash of data files becomes less expensive because we can ask > > git-annex for the hash it has already computed. > > > > * Support getting annex data files on demand (i.e. 'git annex get') if > > they are needed as inputs. > > I wonder what the protocol should look like. Should a workflow > explicitly request a “git annex” file or should it be up to the person > running the workflow, i.e. when “git annex” has been configured to be > the cache backend it would simply look up the declared input/output > files there. > > I suppose the answers would equally apply to using IPFS as a cache. I agree that the mechanism such as `git-annex` should be nice. But is it not a mean for the CAS that we previously discussed? I fully agree with the features and their description. Totally cool! However, I am a bit reluctant with `git-annex` because it requires a Haskell compiler and it is far far from "bootstrapability". I am aware of the Ricardo's try---and AFIAK the only one. And here [1] explanations by one Haskeller. My opinion: GWL should stay on the path of Reproducibility, end-to-end. So `git-annex` should be a transitional step---while the Haskell bootstrap is not solved---as a mean for the CAS (cache) and I would find more elegant to use the "data-oriented IPFS": IPLD [2]. [1] https://www.joachim-breitner.de/blog/748-Thoughts_on_bootstrapping_GHC [2] https://ipld.io/ All the best, simon ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 10:55 ` zimoun @ 2019-06-06 11:59 ` Ricardo Wurmus 2019-06-06 13:44 ` Pjotr Prins 1 sibling, 0 replies; 16+ messages in thread From: Ricardo Wurmus @ 2019-06-06 11:59 UTC (permalink / raw) To: zimoun; +Cc: Kyle Meyer, gwl-devel Hi simon, > I agree that the mechanism such as `git-annex` should be nice. > But is it not a mean for the CAS that we previously discussed? I does not need to be the *only* mechanism. Multiple backends can serve different users. > I fully agree with the features and their description. Totally cool! > However, I am a bit reluctant with `git-annex` because it requires a > Haskell compiler and it is far far from "bootstrapability". I am aware > of the Ricardo's try---and AFIAK the only one. And here [1] > explanations by one Haskeller. This is off-topic, but I’m probably going to bite the bullet and simply use GCC 2.x to build an old GHC 4.x from the C “source” files, which are surprisingly close to actual source code. I’ve tried to build GHC 4.x with a recent compiler, but the code depends on too many quirks of GCC 2 that make it very hard to be sure about the behavior post migration. (Re [1]: I talked to Joachim at one of the repro builds summit about the GHC bootstrapping attempts, which prompted their blog post.) -- Ricardo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 10:55 ` zimoun 2019-06-06 11:59 ` Ricardo Wurmus @ 2019-06-06 13:44 ` Pjotr Prins 2019-06-06 14:06 ` Pjotr Prins 1 sibling, 1 reply; 16+ messages in thread From: Pjotr Prins @ 2019-06-06 13:44 UTC (permalink / raw) To: zimoun; +Cc: gwl-devel, Ricardo Wurmus IPFS is meant for data sharing and reproducibility. It also allows for private networks which is rather important. Scalability of IPFS is a concern, so either we cache using IPFS or we have some other caching mechanism. git-annex is too much of a hack in my book. It also does not scale that well. Pj. On Thu, Jun 06, 2019 at 12:55:52PM +0200, zimoun wrote: > Hi, > > On Thu, 6 Jun 2019 at 12:11, Ricardo Wurmus > <ricardo.wurmus@mdc-berlin.de> wrote: > > > > One of the things I'd love to do > > > with GWL is to make it play well with git-annex, something that would > > > almost certainly be too specific for GWL itself. For example > > > > > > * Make data caching git-annex aware. When deciding to recompute data > > > files, GWL avoids computing the hash of data files, using scripts as > > > the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net. > > > But if the user is tracking data files with git-annex, getting the > > > hash of data files becomes less expensive because we can ask > > > git-annex for the hash it has already computed. > > > > > > * Support getting annex data files on demand (i.e. 'git annex get') if > > > they are needed as inputs. > > > > I wonder what the protocol should look like. Should a workflow > > explicitly request a “git annex” file or should it be up to the person > > running the workflow, i.e. when “git annex” has been configured to be > > the cache backend it would simply look up the declared input/output > > files there. > > > > I suppose the answers would equally apply to using IPFS as a cache. > > I agree that the mechanism such as `git-annex` should be nice. > But is it not a mean for the CAS that we previously discussed? > > I fully agree with the features and their description. Totally cool! > However, I am a bit reluctant with `git-annex` because it requires a > Haskell compiler and it is far far from "bootstrapability". I am aware > of the Ricardo's try---and AFIAK the only one. And here [1] > explanations by one Haskeller. > > My opinion: GWL should stay on the path of Reproducibility, > end-to-end. So `git-annex` should be a transitional step---while the > Haskell bootstrap is not solved---as a mean for the CAS (cache) and I > would find more elegant to use the "data-oriented IPFS": IPLD [2]. > > > [1] https://www.joachim-breitner.de/blog/748-Thoughts_on_bootstrapping_GHC > [2] https://ipld.io/ > > > All the best, > simon > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 13:44 ` Pjotr Prins @ 2019-06-06 14:06 ` Pjotr Prins 0 siblings, 0 replies; 16+ messages in thread From: Pjotr Prins @ 2019-06-06 14:06 UTC (permalink / raw) To: zimoun; +Cc: gwl-devel, Ricardo Wurmus We should also assess this https://labs.eleks.com/2019/03/ipfs-network-data-replication.html On Thu, Jun 06, 2019 at 08:44:04AM -0500, Pjotr Prins wrote: > IPFS is meant for data sharing and reproducibility. It also allows for > private networks which is rather important. > > Scalability of IPFS is a concern, so either we cache using IPFS or we > have some other caching mechanism. > > git-annex is too much of a hack in my book. It also does not scale > that well. > > Pj. > > On Thu, Jun 06, 2019 at 12:55:52PM +0200, zimoun wrote: > > Hi, > > > > On Thu, 6 Jun 2019 at 12:11, Ricardo Wurmus > > <ricardo.wurmus@mdc-berlin.de> wrote: > > > > > > One of the things I'd love to do > > > > with GWL is to make it play well with git-annex, something that would > > > > almost certainly be too specific for GWL itself. For example > > > > > > > > * Make data caching git-annex aware. When deciding to recompute data > > > > files, GWL avoids computing the hash of data files, using scripts as > > > > the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net. > > > > But if the user is tracking data files with git-annex, getting the > > > > hash of data files becomes less expensive because we can ask > > > > git-annex for the hash it has already computed. > > > > > > > > * Support getting annex data files on demand (i.e. 'git annex get') if > > > > they are needed as inputs. > > > > > > I wonder what the protocol should look like. Should a workflow > > > explicitly request a “git annex” file or should it be up to the person > > > running the workflow, i.e. when “git annex” has been configured to be > > > the cache backend it would simply look up the declared input/output > > > files there. > > > > > > I suppose the answers would equally apply to using IPFS as a cache. > > > > I agree that the mechanism such as `git-annex` should be nice. > > But is it not a mean for the CAS that we previously discussed? > > > > I fully agree with the features and their description. Totally cool! > > However, I am a bit reluctant with `git-annex` because it requires a > > Haskell compiler and it is far far from "bootstrapability". I am aware > > of the Ricardo's try---and AFIAK the only one. And here [1] > > explanations by one Haskeller. > > > > My opinion: GWL should stay on the path of Reproducibility, > > end-to-end. So `git-annex` should be a transitional step---while the > > Haskell bootstrap is not solved---as a mean for the CAS (cache) and I > > would find more elegant to use the "data-oriented IPFS": IPLD [2]. > > > > > > [1] https://www.joachim-breitner.de/blog/748-Thoughts_on_bootstrapping_GHC > > [2] https://ipld.io/ > > > > > > All the best, > > simon > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 10:11 ` Ricardo Wurmus 2019-06-06 10:55 ` zimoun @ 2019-06-06 15:07 ` Kyle Meyer 2019-06-06 20:29 ` Ricardo Wurmus 1 sibling, 1 reply; 16+ messages in thread From: Kyle Meyer @ 2019-06-06 15:07 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: >> One of the things I'd love to do >> with GWL is to make it play well with git-annex, something that would >> almost certainly be too specific for GWL itself. [...] > I wonder what the protocol should look like. Should a workflow > explicitly request a “git annex” file or should it be up to the person > running the workflow, i.e. when “git annex” has been configured to be > the cache backend it would simply look up the declared input/output > files there. The latter is what I had in mind. One benefit I see of leaving it up to the configured backend is that it makes it easier to share a workflow with someone that doesn't have/want the requirements for a particular backend. >>> * add support for executing processes in isolated environments >>> (containers) — this requires a better understanding of process inputs. [...] > This means that it can map file systems into the container and then run > the process expression in that environment. > > One thing I’m not happy about is that I can only mount directories, and > not individual files that have been declared as inputs. I’d like to > have more fine-grained access. Right, limiting to the declared files makes sense. With `docker run', you can give files to -v: % ls /tmp/ | wc -l 121 % file /tmp/scratch /tmp/scratch: ASCII text % docker run -it --rm -v /tmp/scratch:/tmp/scratch busybox ls /tmp scratch It looks like using files works for `guix environment' too, which makes me think that call-with-container can handle receiving files in MOUNT. % guix environment -C --ad-hoc coreutils -- ls /tmp | wc -l 0 % guix environment -C --expose=/tmp/scratch=/tmp/scratch --ad-hoc coreutils -- ls /tmp scratch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 15:07 ` Kyle Meyer @ 2019-06-06 20:29 ` Ricardo Wurmus 2019-06-07 4:11 ` Kyle Meyer 0 siblings, 1 reply; 16+ messages in thread From: Ricardo Wurmus @ 2019-06-06 20:29 UTC (permalink / raw) To: Kyle Meyer; +Cc: gwl-devel Kyle Meyer <kyle@kyleam.com> writes: > Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: > >>> One of the things I'd love to do >>> with GWL is to make it play well with git-annex, something that would >>> almost certainly be too specific for GWL itself. > > [...] > >> I wonder what the protocol should look like. Should a workflow >> explicitly request a “git annex” file or should it be up to the person >> running the workflow, i.e. when “git annex” has been configured to be >> the cache backend it would simply look up the declared input/output >> files there. > > The latter is what I had in mind. One benefit I see of leaving it up to > the configured backend is that it makes it easier to share a workflow > with someone that doesn't have/want the requirements for a particular > backend. I agree, this would be convenient. I’m not familiar with git annex. Would you be interested in drafting this feature, e.g. by writing a patch or specifying how it should work in detail? >>>> * add support for executing processes in isolated environments >>>> (containers) — this requires a better understanding of process inputs. > > [...] > >> This means that it can map file systems into the container and then run >> the process expression in that environment. >> >> One thing I’m not happy about is that I can only mount directories, and >> not individual files that have been declared as inputs. I’d like to >> have more fine-grained access. > > Right, limiting to the declared files makes sense. > > With `docker run', you can give files to -v: > > % ls /tmp/ | wc -l > 121 > % file /tmp/scratch > /tmp/scratch: ASCII text > % docker run -it --rm -v /tmp/scratch:/tmp/scratch busybox ls /tmp > scratch > > It looks like using files works for `guix environment' too, which makes > me think that call-with-container can handle receiving files in MOUNT. > > % guix environment -C --ad-hoc coreutils -- ls /tmp | wc -l > 0 > % guix environment -C --expose=/tmp/scratch=/tmp/scratch --ad-hoc coreutils -- ls /tmp > scratch Oh, neat. I’ll give this a try later. Thanks! -- Ricardo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-06-06 20:29 ` Ricardo Wurmus @ 2019-06-07 4:11 ` Kyle Meyer 0 siblings, 0 replies; 16+ messages in thread From: Kyle Meyer @ 2019-06-07 4:11 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: > Kyle Meyer <kyle@kyleam.com> writes: > >> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: >> >>>> One of the things I'd love to do >>>> with GWL is to make it play well with git-annex, something that would >>>> almost certainly be too specific for GWL itself. >> >> [...] >> >>> I wonder what the protocol should look like. Should a workflow >>> explicitly request a “git annex” file or should it be up to the person >>> running the workflow, i.e. when “git annex” has been configured to be >>> the cache backend it would simply look up the declared input/output >>> files there. >> >> The latter is what I had in mind. One benefit I see of leaving it up to >> the configured backend is that it makes it easier to share a workflow >> with someone that doesn't have/want the requirements for a particular >> backend. > > I agree, this would be convenient. > I’m not familiar with git annex. Would you be interested in drafting > this feature, e.g. by writing a patch or specifying how it should work > in detail? Sure, I'll work on putting a patch together so there's something more concrete to discuss. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Next steps for the GWL 2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus 2019-06-03 15:16 ` zimoun 2019-06-06 3:19 ` Kyle Meyer @ 2019-06-12 9:46 ` Ricardo Wurmus 2 siblings, 0 replies; 16+ messages in thread From: Ricardo Wurmus @ 2019-06-12 9:46 UTC (permalink / raw) To: gwl-devel Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: > My goals for the future (in no particular order) are as follows: > > * add support for running workflows from a file (without > GUIX_WORKFLOW_PATH) This is now implemented. > * add support for executing processes in isolated environments > (containers) — this requires a better understanding of process inputs. A primitive version of this is also implemented now. Every generated script supports containerization when GWL_CONTAINERIZE is set. (I don’t expect users to set this manually, but to have a “driver” script that sets it according to user configurations.) This is one of the use cases that needs to be understood better. I would like different execution backends to be available in the generated job scripts without having to make this decision at preparation time. I want developers to be able to distribute workflow artifacts that are flexible enough to execute the workflow in different ways, so I think container support must be switchable at runtime. -- Ricardo ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2019-06-12 9:46 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus 2019-06-03 15:16 ` zimoun 2019-06-03 16:18 ` Ricardo Wurmus 2019-06-06 11:07 ` zimoun 2019-06-06 12:19 ` Ricardo Wurmus 2019-06-06 13:23 ` Pjotr Prins 2019-06-06 3:19 ` Kyle Meyer 2019-06-06 10:11 ` Ricardo Wurmus 2019-06-06 10:55 ` zimoun 2019-06-06 11:59 ` Ricardo Wurmus 2019-06-06 13:44 ` Pjotr Prins 2019-06-06 14:06 ` Pjotr Prins 2019-06-06 15:07 ` Kyle Meyer 2019-06-06 20:29 ` Ricardo Wurmus 2019-06-07 4:11 ` Kyle Meyer 2019-06-12 9:46 ` Ricardo Wurmus
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).