Next steps for the GWL

unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed

* Next steps for the GWL
@ 2019-05-29 13:47 Ricardo Wurmus
  2019-06-03 15:16 ` zimoun
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Ricardo Wurmus @ 2019-05-29 13:47 UTC (permalink / raw)
  To: gwl-devel

Hi,

I’m going to use the GWL in the next few days to rewrite the PiGx RNAseq
pipeline from Snakemake.  This will likely show me what features are
still missing from the GWL and what implemented features are awkward to
use.

My goals for the future (in no particular order) are as follows:

* add support for running workflows from a file (without
  GUIX_WORKFLOW_PATH)

* replace the “qsub”-based Grid Engine execution engine with proper
  DRMAA bindings

* add support for running remote jobs via Guile SSH

* add support for spawning jobs on AWS or Azure or via Kubernetes

* tighter integration with Guix features, e.g. to export a container
  image per process via “guix pack” or to pack up the whole workflow as
  a relocatable executable.

* inversion of control: enable workflow designers to use the GWL as a
  library, so that the “guix workflow” user interface does not need to
  be used at all (see PiGx for an example).

* explore the use of inferiors — the GWL should be usable with any
  version of Guix that may be installed, not just the version that was
  used at compilation time.  Can we use “guix repl” and inferiors,
  perhaps?

* add support for executing processes in isolated environments
  (containers) — this requires a better understanding of process inputs.

What do you think about this?  Who would like to help flesh out these
ideas?

--
Ricardo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus
@ 2019-06-03 15:16 ` zimoun
  2019-06-03 16:18   ` Ricardo Wurmus
  2019-06-06  3:19 ` Kyle Meyer
  2019-06-12  9:46 ` Ricardo Wurmus
  2 siblings, 1 reply; 16+ messages in thread
From: zimoun @ 2019-06-03 15:16 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Hi Ricardo,

On Wed, 29 May 2019 at 16:41, Ricardo Wurmus
<ricardo.wurmus@mdc-berlin.de> wrote:
>
> I’m going to use the GWL in the next few days to rewrite the PiGx RNAseq
> pipeline from Snakemake.  This will likely show me what features are
> still missing from the GWL and what implemented features are awkward to
> use.

Awesome!


> * tighter integration with Guix features, e.g. to export a container
>   image per process via “guix pack” or to pack up the whole workflow as
>   a relocatable executable.

Yes! Awesome.
Relocatable tarballs. Docker images. Singularity one.

And maybe generate one pack (docker) per process and something to glue
together, e.g.,
http://www.genouest.org/godocker/


> * explore the use of inferiors — the GWL should be usable with any
>   version of Guix that may be installed, not just the version that was
>   used at compilation time.  Can we use “guix repl” and inferiors,
>   perhaps?

For reproducibility, a Guix commit should be provided and a `guix
pull` (inferiors) should be used.
For example, the output of `guix describe -f channels` should be used,
either with an option, either directly in the Scheme/Wisp workflow
file with a new keyword.

>
> * add support for executing processes in isolated environments
>   (containers) — this requires a better understanding of process inputs.

Maybe this is the same story than the GoDocker above.


Talking about ideas:
 - what about the Content Adressable Store?
 - what about a bridge with CWL?


Thank you to revive the list. :-)


All the best,
simon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-03 15:16 ` zimoun
@ 2019-06-03 16:18   ` Ricardo Wurmus
  2019-06-06 11:07     ` zimoun
  0 siblings, 1 reply; 16+ messages in thread
From: Ricardo Wurmus @ 2019-06-03 16:18 UTC (permalink / raw)
  To: zimoun; +Cc: gwl-devel

Hi simon,

>> * tighter integration with Guix features, e.g. to export a container
>>   image per process via “guix pack” or to pack up the whole workflow as
>>   a relocatable executable.
>
> Yes! Awesome.
> Relocatable tarballs. Docker images. Singularity one.
>
> And maybe generate one pack (docker) per process and something to glue
> together, e.g.,
> http://www.genouest.org/godocker/

Generating one “containe image” per process is a desirable goal (even
though it seems a little wasteful).  I don’t know how godocker fits into
this.  The home page says;

    It is a batch computing/cluster management tool using Docker as
    execution/isolation system. It can be seen like Sun Grid
    Engine/Torque/... The software does not manage however itself the
    dispatch of the commands on the remote nodes. For this, it integrates
    with container management tools (Docker Swarm, Apache Mesos, ...) It
    acts as an additional layer above those tools on multiple user systems
    where users do not have Docker priviledges or knowledge.

Can we directly support these container management tools?  I’d like to
make GWL workflows very portable, so that there are only few runtime
requirements.  Depending on a cluster management tool to be configured
would be counter to this goal.

> Talking about ideas:
>  - what about the Content Adressable Store?

This already exists, but I’m not sure it’s sufficient.

>  - what about a bridge with CWL?

I’m open to this idea, but it would need to be well-defined.  What does
it really mean?  Generating CWL files from GWL workflows?  That really
shouldn’t be too hard.  Anything else, however, is hard for me to
imagine.

--
Ricardo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-03 16:18   ` Ricardo Wurmus
@ 2019-06-06 11:07     ` zimoun
  2019-06-06 12:19       ` Ricardo Wurmus
  0 siblings, 1 reply; 16+ messages in thread
From: zimoun @ 2019-06-06 11:07 UTC (permalink / raw)
  To: Ricardo Wurmus, Pjotr Prins; +Cc: gwl-devel

Hi,

(+ Pjotr because I am sure he has an interesting opinion but not sure
he closely reads this list ;-)

On Mon, 3 Jun 2019 at 18:18, Ricardo Wurmus
<ricardo.wurmus@mdc-berlin.de> wrote:

> >  - what about a bridge with CWL?
>
> I’m open to this idea, but it would need to be well-defined.  What does
> it really mean?  Generating CWL files from GWL workflows?  That really
> shouldn’t be too hard.  Anything else, however, is hard for me to
> imagine.

Well, I point out previous threads about this topic:

https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00428.html
https://lists.gnu.org/archive/html/gwl-devel/2019-02/msg00019.html

1-
Generating CWL from GWL should be nice. It should ease the use of
already in-place platform and tools  (AWS, etc.)

2-
Use CWL as a process. A lot of work have been done by Pjotr and
reported here [1]


[1] https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/


All the best,
simon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 11:07     ` zimoun
@ 2019-06-06 12:19       ` Ricardo Wurmus
  2019-06-06 13:23         ` Pjotr Prins
  0 siblings, 1 reply; 16+ messages in thread
From: Ricardo Wurmus @ 2019-06-06 12:19 UTC (permalink / raw)
  To: zimoun; +Cc: Pjotr Prins, gwl-devel

Hi simon,

> (+ Pjotr because I am sure he has an interesting opinion but not sure
> he closely reads this list ;-)
>
> On Mon, 3 Jun 2019 at 18:18, Ricardo Wurmus
> <ricardo.wurmus@mdc-berlin.de> wrote:
>
>> >  - what about a bridge with CWL?
>>
>> I’m open to this idea, but it would need to be well-defined.  What does
>> it really mean?  Generating CWL files from GWL workflows?  That really
>> shouldn’t be too hard.  Anything else, however, is hard for me to
>> imagine.
>
> Well, I point out previous threads about this topic:
>
> https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00428.html
> https://lists.gnu.org/archive/html/gwl-devel/2019-02/msg00019.html
>
> 1-
> Generating CWL from GWL should be nice. It should ease the use of
> already in-place platform and tools  (AWS, etc.)

Generating CWL from GWL should be easy, but it’s also not all that
useful.  The GWL takes care of software deployment, so not only should
we generate CWL files but also generate (and upload?) Docker images and
make the CWL file reference them.

The tooling for CWL… seems a little less substantial and focused than it
first appears.  The cwltool can only run CWL workflows locally — no
DRMAA, no AWS.  All the other runners that are listed on the CWL website
are either very limited or very large environments where CWL execution
is not necessarily the primary purpose (cf Galaxy or Arvados).

Still, I think it’s the most meanigful connection the GWL can have with
the CWL: using the GWL as a high-level representation which “compiles”
down to a lower-level representation of CWL + Docker images when needed.

> 2-
> Use CWL as a process. A lot of work have been done by Pjotr and
> reported here [1]
>
>
> [1] https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/

Yes, this works, of course, but that’s a level of integration that’s
extremely limited, in my opinion.  Using Guix with the CWL is fine as
the blog post demonstrates, but there is very little to be gained and
much to be lost when embedding CWL in a GWL workflow.  The only thing
this enables is reusing existing CWL workflows as a GWL “process”.
There is no meaningful integration – the embedded CWL workflow is a
second-class citizen that cannot benefit from any of the GWL features.

If the CWL workflow is connected to the GWL via cwltool then the only
way to run the workflow on a DRMAA-supported cluster or a bunch of
SSH-connected servers, or AWS EC2 instances is to wrap it up in a GWL
context.  The GWL treats the process as its smallest unit of
organisation, so a CWL workflow that’s run as a GWL process cannot
really be scaled.  If the user has a different CWL execution environment
(such as an Arvados installation), the CWL workflow embedded in the GWL
will not be able to make use of it.  It would forever be tied to the
particular version of cwltool in Guix.

I’d rather not advocate this use of the CWL in the GWL.  It might sound
good (“The GWL is compatible with the CWL!”), but ultimately it’s a
really awkward connection that is bound to lead to a great deal of
frustration.

Does this make sense?

I don’t want to be dismissive.  It would be great if we could come up
with something that’s mutually beneficial for CWL users and GWL users
alike, but I feel that our options are very limited.  I’m still open to
ideas and use-case scenarios.

--
Ricardo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 12:19       ` Ricardo Wurmus
@ 2019-06-06 13:23         ` Pjotr Prins
  0 siblings, 0 replies; 16+ messages in thread
From: Pjotr Prins @ 2019-06-06 13:23 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

On Thu, Jun 06, 2019 at 02:19:04PM +0200, Ricardo Wurmus wrote:
> 
> Hi simon,
> 
> > (+ Pjotr because I am sure he has an interesting opinion but not sure
> > he closely reads this list ;-)

I read it :)

> > On Mon, 3 Jun 2019 at 18:18, Ricardo Wurmus
> > <ricardo.wurmus@mdc-berlin.de> wrote:
> >
> >> >  - what about a bridge with CWL?
> >>
> >> I’m open to this idea, but it would need to be well-defined.  What does
> >> it really mean?  Generating CWL files from GWL workflows?  That really
> >> shouldn’t be too hard.  Anything else, however, is hard for me to
> >> imagine.
> >
> > Well, I point out previous threads about this topic:
> >
> > https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00428.html
> > https://lists.gnu.org/archive/html/gwl-devel/2019-02/msg00019.html
> >
> > 1-
> > Generating CWL from GWL should be nice. It should ease the use of
> > already in-place platform and tools  (AWS, etc.)
> 
> Generating CWL from GWL should be easy, but it’s also not all that
> useful.  The GWL takes care of software deployment, so not only should
> we generate CWL files but also generate (and upload?) Docker images and
> make the CWL file reference them.
> 
> The tooling for CWL… seems a little less substantial and focused than it
> first appears.  The cwltool can only run CWL workflows locally — no
> DRMAA, no AWS.  All the other runners that are listed on the CWL website
> are either very limited or very large environments where CWL execution
> is not necessarily the primary purpose (cf Galaxy or Arvados).
> 
> Still, I think it’s the most meanigful connection the GWL can have with
> the CWL: using the GWL as a high-level representation which “compiles”
> down to a lower-level representation of CWL + Docker images when needed.
> 
> > 2-
> > Use CWL as a process. A lot of work have been done by Pjotr and
> > reported here [1]
> >
> >
> > [1] https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/
> 
> Yes, this works, of course, but that’s a level of integration that’s
> extremely limited, in my opinion.  Using Guix with the CWL is fine as
> the blog post demonstrates, but there is very little to be gained and
> much to be lost when embedding CWL in a GWL workflow.  The only thing
> this enables is reusing existing CWL workflows as a GWL “process”.
> There is no meaningful integration – the embedded CWL workflow is a
> second-class citizen that cannot benefit from any of the GWL features.
> 
> If the CWL workflow is connected to the GWL via cwltool then the only
> way to run the workflow on a DRMAA-supported cluster or a bunch of
> SSH-connected servers, or AWS EC2 instances is to wrap it up in a GWL
> context.  The GWL treats the process as its smallest unit of
> organisation, so a CWL workflow that’s run as a GWL process cannot
> really be scaled.  If the user has a different CWL execution environment
> (such as an Arvados installation), the CWL workflow embedded in the GWL
> will not be able to make use of it.  It would forever be tied to the
> particular version of cwltool in Guix.
> 
> I’d rather not advocate this use of the CWL in the GWL.  It might sound
> good (“The GWL is compatible with the CWL!”), but ultimately it’s a
> really awkward connection that is bound to lead to a great deal of
> frustration.
> 
> Does this make sense?

Yes. Personally I also think the CWL is flawed. It overcomplicates
things and the reference implementation is pretty crappy. If we get
GWL to work in my environment I would think it a breath of fresh air.

Not to say that the CWL does not have some bad ideas (triple
negative). You can read my blog for that.

> I don’t want to be dismissive.  It would be great if we could come
> up
> with something that’s mutually beneficial for CWL users and GWL users
> alike, but I feel that our options are very limited.  I’m still open to
> ideas and use-case scenarios.

We can probably just mix the two. I mean the main benefit of the CWL
is *sharing* workflows that have been described by others. That is the
point of the CWL and even at that it does not prove really great
(after all this time how much is shared?).

Since CWL and GWL can use the same file system and job submission
system I think it is pretty OK for GWL to ignore the CWL and either
send data from one to the other or execute CWL pipelines from GWL.

Both possible without much work.

Pj.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus
  2019-06-03 15:16 ` zimoun
@ 2019-06-06  3:19 ` Kyle Meyer
  2019-06-06 10:11   ` Ricardo Wurmus
  2019-06-12  9:46 ` Ricardo Wurmus
  2 siblings, 1 reply; 16+ messages in thread
From: Kyle Meyer @ 2019-06-06  3:19 UTC (permalink / raw)
  To: Ricardo Wurmus, gwl-devel

Hi Ricardo,

Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:

> My goals for the future (in no particular order) are as follows:

Thanks for sharing.  Looks very exciting!

> [...]
>
> * inversion of control: enable workflow designers to use the GWL as a
>   library, so that the “guix workflow” user interface does not need to
>   be used at all (see PiGx for an example).

Sounds like a good direction to go, and I imagine it'd facilitate
building wrappers that extend GWL.  One of the things I'd love to do
with GWL is to make it play well with git-annex, something that would
almost certainly be too specific for GWL itself.  For example

  * Make data caching git-annex aware.  When deciding to recompute data
    files, GWL avoids computing the hash of data files, using scripts as
    the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net.
    But if the user is tracking data files with git-annex, getting the
    hash of data files becomes less expensive because we can ask
    git-annex for the hash it has already computed.

  * Support getting annex data files on demand (i.e. 'git annex get') if
    they are needed as inputs.

> * explore the use of inferiors — the GWL should be usable with any
>   version of Guix that may be installed, not just the version that was
>   used at compilation time.  Can we use “guix repl” and inferiors,
>   perhaps?

For my personal use, I'd almost always want to pin an analysis workflow
at a certain Guix version, so making it easy to use inferiors in the
workflow would be great.

> * add support for executing processes in isolated environments
>   (containers) — this requires a better understanding of process inputs.

This is another one I'm especially excited about.  Functionality-wise,
are you imagining essentially matching the options available for 'guix
environment --container ...'?

-- 
Kyle

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06  3:19 ` Kyle Meyer
@ 2019-06-06 10:11   ` Ricardo Wurmus
  2019-06-06 10:55     ` zimoun
  2019-06-06 15:07     ` Kyle Meyer
  0 siblings, 2 replies; 16+ messages in thread
From: Ricardo Wurmus @ 2019-06-06 10:11 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: gwl-devel


Hi Kyle,

thanks for your comments!

> One of the things I'd love to do
> with GWL is to make it play well with git-annex, something that would
> almost certainly be too specific for GWL itself.  For example
>
>   * Make data caching git-annex aware.  When deciding to recompute data
>     files, GWL avoids computing the hash of data files, using scripts as
>     the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net.
>     But if the user is tracking data files with git-annex, getting the
>     hash of data files becomes less expensive because we can ask
>     git-annex for the hash it has already computed.
>
>   * Support getting annex data files on demand (i.e. 'git annex get') if
>     they are needed as inputs.

I wonder what the protocol should look like.  Should a workflow
explicitly request a “git annex” file or should it be up to the person
running the workflow, i.e. when “git annex” has been configured to be
the cache backend it would simply look up the declared input/output
files there.

I suppose the answers would equally apply to using IPFS as a cache.

>> * add support for executing processes in isolated environments
>>   (containers) — this requires a better understanding of process inputs.
>
> This is another one I'm especially excited about.  Functionality-wise,
> are you imagining essentially matching the options available for 'guix
> environment --container ...'?

So far this is all I’ve got:

--8<---------------cut here---------------start------------->8---
diff --git a/gwl/processes.scm b/gwl/processes.scm
index beb61cc..264807f 100644
--- a/gwl/processes.scm
+++ b/gwl/processes.scm
@@ -19,13 +19,19 @@
   #:use-module ((guix derivations)
                 #:select (derivation->output-path
                           build-derivations))
+  #:use-module ((guix packages)
+                #:select (package-file))
   #:use-module (guix gexp)
-  #:use-module ((guix monads) #:select (mlet return))
+  #:use-module ((guix monads) #:select (mlet mapm return))
   #:use-module (guix records)
   #:use-module ((guix store)
                 #:select (run-with-store
                           with-store
                           %store-monad))
+  #:use-module ((guix modules)
+                #:select (source-module-closure))
+  #:use-module (gnu system file-systems)
+  #:use-module (gnu build linux-container)
   #:use-module (ice-9 format)
   #:use-module (ice-9 match)
   #:use-module (srfi srfi-1)
@@ -276,6 +282,54 @@ plain S-expression."
        (call process code)))
     (whatever (error (format #f "unsupported procedure: ~a\n" whatever)))))

+;; WIP
+(define (containerize exp process)
+  "Wrap EXP, an S-expression or G-expression, in a G-expression that
+causes EXP to be run in a container according to the requirements
+specified in PROCESS."
+  (let* ((package-dirs
+          (with-store store
+            (run-with-store store
+              (mapm %store-monad package-file
+                    (process-package-inputs process)))))
+         (data-inputs
+          (process-data-inputs process))
+         (output-dirs
+          (delete-duplicates
+           (map dirname (process-outputs process))))
+         (input-mappings
+          (map (lambda (location)
+                 (file-system-mapping
+                  (source location)
+                  (target location)
+                  (writable? #f)))
+               (lset-difference string=?
+                                (append package-dirs
+                                        data-inputs)
+                                output-dirs)))
+         (output-mappings
+          (map (lambda (dir)
+                 (file-system-mapping
+                  (source dir)
+                  (target dir)
+                  (writable? #t)))
+               output-dirs))
+         (specs
+          (map (compose file-system->spec
+                        file-system-mapping->bind-mount)
+               (append input-mappings
+                       output-mappings))))
+    (with-imported-modules (source-module-closure
+                            '((gnu build linux-container)
+                              (gnu system file-systems)))
+      #~(begin
+          (use-modules (gnu build linux-container)
+                       (gnu system file-systems))
+          (call-with-container (append %container-file-systems
+                                       (map spec->file-system
+                                            '#$specs))
+            (lambda () #$exp))))))
+
 ;;; ---------------------------------------------------------------------------
 ;;; ADDITIONAL FUNCTIONS
 ;;; ---------------------------------------------------------------------------
--8<---------------cut here---------------end--------------->8---

This means that it can map file systems into the container and then run
the process expression in that environment.

One thing I’m not happy about is that I can only mount directories, and
not individual files that have been declared as inputs.  I’d like to
have more fine-grained access.  I suppose it might be possible to mount
just the relevant parts of the GWL cache, but I need to play with this
to better understand what the desired behaviour would be.

--
Ricardo

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 10:11   ` Ricardo Wurmus
@ 2019-06-06 10:55     ` zimoun
  2019-06-06 11:59       ` Ricardo Wurmus
  2019-06-06 13:44       ` Pjotr Prins
  2019-06-06 15:07     ` Kyle Meyer
  1 sibling, 2 replies; 16+ messages in thread
From: zimoun @ 2019-06-06 10:55 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: Kyle Meyer, gwl-devel

Hi,

On Thu, 6 Jun 2019 at 12:11, Ricardo Wurmus
<ricardo.wurmus@mdc-berlin.de> wrote:

> > One of the things I'd love to do
> > with GWL is to make it play well with git-annex, something that would
> > almost certainly be too specific for GWL itself.  For example
> >
> >   * Make data caching git-annex aware.  When deciding to recompute data
> >     files, GWL avoids computing the hash of data files, using scripts as
> >     the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net.
> >     But if the user is tracking data files with git-annex, getting the
> >     hash of data files becomes less expensive because we can ask
> >     git-annex for the hash it has already computed.
> >
> >   * Support getting annex data files on demand (i.e. 'git annex get') if
> >     they are needed as inputs.
>
> I wonder what the protocol should look like.  Should a workflow
> explicitly request a “git annex” file or should it be up to the person
> running the workflow, i.e. when “git annex” has been configured to be
> the cache backend it would simply look up the declared input/output
> files there.
>
> I suppose the answers would equally apply to using IPFS as a cache.

I agree that the mechanism such as `git-annex` should be nice.
But is it not a mean for the CAS that we previously discussed?

I fully agree with the features and their description. Totally cool!
However, I am a bit reluctant with `git-annex` because it requires a
Haskell compiler and it is far far from "bootstrapability". I am aware
of the Ricardo's try---and AFIAK the only one. And here [1]
explanations by one Haskeller.

My opinion: GWL should stay on the path of Reproducibility,
end-to-end. So `git-annex` should be a transitional step---while the
Haskell bootstrap is not solved---as a mean for the CAS (cache) and I
would find more elegant to use the "data-oriented IPFS": IPLD [2].

[1] https://www.joachim-breitner.de/blog/748-Thoughts_on_bootstrapping_GHC
[2] https://ipld.io/

All the best,
simon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 10:55     ` zimoun
@ 2019-06-06 11:59       ` Ricardo Wurmus
  2019-06-06 13:44       ` Pjotr Prins
  1 sibling, 0 replies; 16+ messages in thread
From: Ricardo Wurmus @ 2019-06-06 11:59 UTC (permalink / raw)
  To: zimoun; +Cc: Kyle Meyer, gwl-devel

Hi simon,

> I agree that the mechanism such as `git-annex` should be nice.
> But is it not a mean for the CAS that we previously discussed?

I does not need to be the *only* mechanism.  Multiple backends can serve
different users.

> I fully agree with the features and their description. Totally cool!
> However, I am a bit reluctant with `git-annex` because it requires a
> Haskell compiler and it is far far from "bootstrapability". I am aware
> of the Ricardo's try---and AFIAK the only one. And here [1]
> explanations by one Haskeller.

This is off-topic, but I’m probably going to bite the bullet and simply
use GCC 2.x to build an old GHC 4.x from the C “source” files, which are
surprisingly close to actual source code.  I’ve tried to build GHC 4.x
with a recent compiler, but the code depends on too many quirks of GCC 2
that make it very hard to be sure about the behavior post migration.

(Re [1]: I talked to Joachim at one of the repro builds summit about the
GHC bootstrapping attempts, which prompted their blog post.)

--
Ricardo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 10:55     ` zimoun
  2019-06-06 11:59       ` Ricardo Wurmus
@ 2019-06-06 13:44       ` Pjotr Prins
  2019-06-06 14:06         ` Pjotr Prins
  1 sibling, 1 reply; 16+ messages in thread
From: Pjotr Prins @ 2019-06-06 13:44 UTC (permalink / raw)
  To: zimoun; +Cc: gwl-devel, Ricardo Wurmus

IPFS is meant for data sharing and reproducibility. It also allows for
private networks which is rather important.

Scalability of IPFS is a concern, so either we cache using IPFS or we
have some other caching mechanism.

git-annex is too much of a hack in my book. It also does not scale
that well.

Pj.

On Thu, Jun 06, 2019 at 12:55:52PM +0200, zimoun wrote:
> Hi,
> 
> On Thu, 6 Jun 2019 at 12:11, Ricardo Wurmus
> <ricardo.wurmus@mdc-berlin.de> wrote:
> 
> > > One of the things I'd love to do
> > > with GWL is to make it play well with git-annex, something that would
> > > almost certainly be too specific for GWL itself.  For example
> > >
> > >   * Make data caching git-annex aware.  When deciding to recompute data
> > >     files, GWL avoids computing the hash of data files, using scripts as
> > >     the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net.
> > >     But if the user is tracking data files with git-annex, getting the
> > >     hash of data files becomes less expensive because we can ask
> > >     git-annex for the hash it has already computed.
> > >
> > >   * Support getting annex data files on demand (i.e. 'git annex get') if
> > >     they are needed as inputs.
> >
> > I wonder what the protocol should look like.  Should a workflow
> > explicitly request a “git annex” file or should it be up to the person
> > running the workflow, i.e. when “git annex” has been configured to be
> > the cache backend it would simply look up the declared input/output
> > files there.
> >
> > I suppose the answers would equally apply to using IPFS as a cache.
> 
> I agree that the mechanism such as `git-annex` should be nice.
> But is it not a mean for the CAS that we previously discussed?
> 
> I fully agree with the features and their description. Totally cool!
> However, I am a bit reluctant with `git-annex` because it requires a
> Haskell compiler and it is far far from "bootstrapability". I am aware
> of the Ricardo's try---and AFIAK the only one. And here [1]
> explanations by one Haskeller.
> 
> My opinion: GWL should stay on the path of Reproducibility,
> end-to-end. So `git-annex` should be a transitional step---while the
> Haskell bootstrap is not solved---as a mean for the CAS (cache) and I
> would find more elegant to use the "data-oriented IPFS": IPLD [2].
> 
> 
> [1] https://www.joachim-breitner.de/blog/748-Thoughts_on_bootstrapping_GHC
> [2] https://ipld.io/
> 
> 
> All the best,
> simon
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 13:44       ` Pjotr Prins
@ 2019-06-06 14:06         ` Pjotr Prins
  0 siblings, 0 replies; 16+ messages in thread
From: Pjotr Prins @ 2019-06-06 14:06 UTC (permalink / raw)
  To: zimoun; +Cc: gwl-devel, Ricardo Wurmus

We should also assess this

https://labs.eleks.com/2019/03/ipfs-network-data-replication.html

On Thu, Jun 06, 2019 at 08:44:04AM -0500, Pjotr Prins wrote:
> IPFS is meant for data sharing and reproducibility. It also allows for
> private networks which is rather important.
> 
> Scalability of IPFS is a concern, so either we cache using IPFS or we
> have some other caching mechanism.
> 
> git-annex is too much of a hack in my book. It also does not scale
> that well.
> 
> Pj.
> 
> On Thu, Jun 06, 2019 at 12:55:52PM +0200, zimoun wrote:
> > Hi,
> > 
> > On Thu, 6 Jun 2019 at 12:11, Ricardo Wurmus
> > <ricardo.wurmus@mdc-berlin.de> wrote:
> > 
> > > > One of the things I'd love to do
> > > > with GWL is to make it play well with git-annex, something that would
> > > > almost certainly be too specific for GWL itself.  For example
> > > >
> > > >   * Make data caching git-annex aware.  When deciding to recompute data
> > > >     files, GWL avoids computing the hash of data files, using scripts as
> > > >     the cheaper proxy, as you described in 87womnnjg0.fsf@elephly.net.
> > > >     But if the user is tracking data files with git-annex, getting the
> > > >     hash of data files becomes less expensive because we can ask
> > > >     git-annex for the hash it has already computed.
> > > >
> > > >   * Support getting annex data files on demand (i.e. 'git annex get') if
> > > >     they are needed as inputs.
> > >
> > > I wonder what the protocol should look like.  Should a workflow
> > > explicitly request a “git annex” file or should it be up to the person
> > > running the workflow, i.e. when “git annex” has been configured to be
> > > the cache backend it would simply look up the declared input/output
> > > files there.
> > >
> > > I suppose the answers would equally apply to using IPFS as a cache.
> > 
> > I agree that the mechanism such as `git-annex` should be nice.
> > But is it not a mean for the CAS that we previously discussed?
> > 
> > I fully agree with the features and their description. Totally cool!
> > However, I am a bit reluctant with `git-annex` because it requires a
> > Haskell compiler and it is far far from "bootstrapability". I am aware
> > of the Ricardo's try---and AFIAK the only one. And here [1]
> > explanations by one Haskeller.
> > 
> > My opinion: GWL should stay on the path of Reproducibility,
> > end-to-end. So `git-annex` should be a transitional step---while the
> > Haskell bootstrap is not solved---as a mean for the CAS (cache) and I
> > would find more elegant to use the "data-oriented IPFS": IPLD [2].
> > 
> > 
> > [1] https://www.joachim-breitner.de/blog/748-Thoughts_on_bootstrapping_GHC
> > [2] https://ipld.io/
> > 
> > 
> > All the best,
> > simon
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 10:11   ` Ricardo Wurmus
  2019-06-06 10:55     ` zimoun
@ 2019-06-06 15:07     ` Kyle Meyer
  2019-06-06 20:29       ` Ricardo Wurmus
  1 sibling, 1 reply; 16+ messages in thread
From: Kyle Meyer @ 2019-06-06 15:07 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:

>> One of the things I'd love to do
>> with GWL is to make it play well with git-annex, something that would
>> almost certainly be too specific for GWL itself.

[...]

> I wonder what the protocol should look like.  Should a workflow
> explicitly request a “git annex” file or should it be up to the person
> running the workflow, i.e. when “git annex” has been configured to be
> the cache backend it would simply look up the declared input/output
> files there.

The latter is what I had in mind.  One benefit I see of leaving it up to
the configured backend is that it makes it easier to share a workflow
with someone that doesn't have/want the requirements for a particular
backend.

>>> * add support for executing processes in isolated environments
>>>   (containers) — this requires a better understanding of process inputs.

[...]

> This means that it can map file systems into the container and then run
> the process expression in that environment.
>
> One thing I’m not happy about is that I can only mount directories, and
> not individual files that have been declared as inputs.  I’d like to
> have more fine-grained access.

Right, limiting to the declared files makes sense.

With `docker run', you can give files to -v:

  % ls /tmp/ | wc -l
  121
  % file /tmp/scratch
  /tmp/scratch: ASCII text
  % docker run -it --rm -v /tmp/scratch:/tmp/scratch busybox ls /tmp
  scratch

It looks like using files works for `guix environment' too, which makes
me think that call-with-container can handle receiving files in MOUNT.

  % guix environment -C --ad-hoc coreutils -- ls /tmp | wc -l
  0
  % guix environment -C --expose=/tmp/scratch=/tmp/scratch --ad-hoc coreutils -- ls /tmp
  scratch

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 15:07     ` Kyle Meyer
@ 2019-06-06 20:29       ` Ricardo Wurmus
  2019-06-07  4:11         ` Kyle Meyer
  0 siblings, 1 reply; 16+ messages in thread
From: Ricardo Wurmus @ 2019-06-06 20:29 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: gwl-devel


Kyle Meyer <kyle@kyleam.com> writes:

> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:
>
>>> One of the things I'd love to do
>>> with GWL is to make it play well with git-annex, something that would
>>> almost certainly be too specific for GWL itself.
>
> [...]
>
>> I wonder what the protocol should look like.  Should a workflow
>> explicitly request a “git annex” file or should it be up to the person
>> running the workflow, i.e. when “git annex” has been configured to be
>> the cache backend it would simply look up the declared input/output
>> files there.
>
> The latter is what I had in mind.  One benefit I see of leaving it up to
> the configured backend is that it makes it easier to share a workflow
> with someone that doesn't have/want the requirements for a particular
> backend.

I agree, this would be convenient.
I’m not familiar with git annex.  Would you be interested in drafting
this feature, e.g. by writing a patch or specifying how it should work
in detail?

>>>> * add support for executing processes in isolated environments
>>>>   (containers) — this requires a better understanding of process inputs.
>
> [...]
>
>> This means that it can map file systems into the container and then run
>> the process expression in that environment.
>>
>> One thing I’m not happy about is that I can only mount directories, and
>> not individual files that have been declared as inputs.  I’d like to
>> have more fine-grained access.
>
> Right, limiting to the declared files makes sense.
>
> With `docker run', you can give files to -v:
>
>   % ls /tmp/ | wc -l
>   121
>   % file /tmp/scratch
>   /tmp/scratch: ASCII text
>   % docker run -it --rm -v /tmp/scratch:/tmp/scratch busybox ls /tmp
>   scratch
>
> It looks like using files works for `guix environment' too, which makes
> me think that call-with-container can handle receiving files in MOUNT.
>
>   % guix environment -C --ad-hoc coreutils -- ls /tmp | wc -l
>   0
>   % guix environment -C --expose=/tmp/scratch=/tmp/scratch --ad-hoc coreutils -- ls /tmp
>   scratch

Oh, neat.  I’ll give this a try later.  Thanks!

-- 
Ricardo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-06-06 20:29       ` Ricardo Wurmus
@ 2019-06-07  4:11         ` Kyle Meyer
  0 siblings, 0 replies; 16+ messages in thread
From: Kyle Meyer @ 2019-06-07  4:11 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:

> Kyle Meyer <kyle@kyleam.com> writes:
>
>> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:
>>
>>>> One of the things I'd love to do
>>>> with GWL is to make it play well with git-annex, something that would
>>>> almost certainly be too specific for GWL itself.
>>
>> [...]
>>
>>> I wonder what the protocol should look like.  Should a workflow
>>> explicitly request a “git annex” file or should it be up to the person
>>> running the workflow, i.e. when “git annex” has been configured to be
>>> the cache backend it would simply look up the declared input/output
>>> files there.
>>
>> The latter is what I had in mind.  One benefit I see of leaving it up to
>> the configured backend is that it makes it easier to share a workflow
>> with someone that doesn't have/want the requirements for a particular
>> backend.
>
> I agree, this would be convenient.
> I’m not familiar with git annex.  Would you be interested in drafting
> this feature, e.g. by writing a patch or specifying how it should work
> in detail?

Sure, I'll work on putting a patch together so there's something more
concrete to discuss.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Next steps for the GWL
  2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus
  2019-06-03 15:16 ` zimoun
  2019-06-06  3:19 ` Kyle Meyer
@ 2019-06-12  9:46 ` Ricardo Wurmus
  2 siblings, 0 replies; 16+ messages in thread
From: Ricardo Wurmus @ 2019-06-12  9:46 UTC (permalink / raw)
  To: gwl-devel

Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:

> My goals for the future (in no particular order) are as follows:
>
> * add support for running workflows from a file (without
>   GUIX_WORKFLOW_PATH)

This is now implemented.

> * add support for executing processes in isolated environments
>   (containers) — this requires a better understanding of process inputs.

A primitive version of this is also implemented now.  Every generated
script supports containerization when GWL_CONTAINERIZE is set.  (I don’t
expect users to set this manually, but to have a “driver” script that
sets it according to user configurations.)

This is one of the use cases that needs to be understood better.  I
would like different execution backends to be available in the generated
job scripts without having to make this decision at preparation time.  I
want developers to be able to distribute workflow artifacts that are
flexible enough to execute the workflow in different ways, so I think
container support must be switchable at runtime.

--
Ricardo

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-06-12  9:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29 13:47 Next steps for the GWL Ricardo Wurmus
2019-06-03 15:16 ` zimoun
2019-06-03 16:18   ` Ricardo Wurmus
2019-06-06 11:07     ` zimoun
2019-06-06 12:19       ` Ricardo Wurmus
2019-06-06 13:23         ` Pjotr Prins
2019-06-06  3:19 ` Kyle Meyer
2019-06-06 10:11   ` Ricardo Wurmus
2019-06-06 10:55     ` zimoun
2019-06-06 11:59       ` Ricardo Wurmus
2019-06-06 13:44       ` Pjotr Prins
2019-06-06 14:06         ` Pjotr Prins
2019-06-06 15:07     ` Kyle Meyer
2019-06-06 20:29       ` Ricardo Wurmus
2019-06-07  4:11         ` Kyle Meyer
2019-06-12  9:46 ` Ricardo Wurmus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).