unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed
* [PATCH] workflow: Consider unspecified free inputs when checking cache.
@ 2019-06-13  3:48 Kyle Meyer
  2019-06-24 13:54 ` Ricardo Wurmus
  0 siblings, 1 reply; 7+ messages in thread
From: Kyle Meyer @ 2019-06-13  3:48 UTC (permalink / raw)
  To: gwl-devel; +Cc: Kyle Meyer

When deciding whether a process is cached, we take into account the
modification time and size of free inputs, but only those that are
explicitly specified on the command line.  As a result, the same
updated input file that would invalidate the cache when given
explicitly to --input does _not_ invalidate the cache when not
specified but instead picked from the current working directory.

Correct this discrepancy by including unspecified free inputs in the
cache prefix calculation.

* gwl/workflows.scm (workflow-run): Pass unspecified free inputs to
make-process->cache-prefix.
---
Note: This is mostly code movement, so something like

  git show --color-moved=zebra --color-moved-ws=allow-indentation-change

might be helpful for reviewing.

 gwl/workflows.scm | 87 +++++++++++++++++++++++++----------------------
 1 file changed, 47 insertions(+), 40 deletions(-)

diff --git a/gwl/workflows.scm b/gwl/workflows.scm
index 50dc2ee..efdf089 100644
--- a/gwl/workflows.scm
+++ b/gwl/workflows.scm
@@ -262,46 +262,51 @@ (define* (workflow-run workflow engine
                 inputs)
       (() '())
       (mapping mapping)))
+
+  (define-values (input-names input-files)
+    (match inputs-map
+      (() (values '() '()))
+      (_ (apply values
+                (apply zip inputs-map)))))
+
+  (define unspecified-inputs
+    (lset-difference equal?
+                     (workflow-free-inputs workflow)
+                     input-names))
+
   (define (inputs-valid?)
-    (let-values (((input-names input-files)
-                  (match inputs-map
-                    (() (values '() '()))
-                    (_ (apply values
-                              (apply zip inputs-map))))))
-      (match (lset-difference equal?
-                              (workflow-free-inputs workflow)
-                              input-names)
-        (()
-         ;; verify input files
-         (match (filter (negate file-exists?) input-files)
-           (()
-            ;; Link all mapped input files to their target locations
-            ;; TODO: ensure that target directories exist.
-            (unless (null? inputs-map)
-              (for-each (match-lambda
-                          ((target source)
-                           (unless (file-exists? target)
-                             (link source target))))
-                        inputs-map))
-            #t)
-           (missing
-            (format (current-error-port)
-                    "Missing files: ~{~%  * ~a~}.~%"
-                    missing)
-            #f)))
-        (missing
-         ;; Try to find the files in the environment.
-         ;; XXX Tell user that we pick the files from the current
-         ;; working directory.
-         ;; XXX These files would need to be mapped into the
-         ;; container.
-         (let* ((found (filter file-exists? missing))
-               (really-missing (lset-difference equal? missing found)))
-           (or (null? really-missing)
-               (begin (format (current-error-port)
-                              "Missing inputs: ~{~%  * ~a~}.~%Provide them with --input=NAME=FILE.~%"
-                              really-missing)
-                      #f)))))))
+    (match unspecified-inputs
+      (()
+       ;; verify input files
+       (match (filter (negate file-exists?) input-files)
+         (()
+          ;; Link all mapped input files to their target locations
+          ;; TODO: ensure that target directories exist.
+          (unless (null? inputs-map)
+            (for-each (match-lambda
+                        ((target source)
+                         (unless (file-exists? target)
+                           (link source target))))
+                      inputs-map))
+          #t)
+         (missing
+          (format (current-error-port)
+                  "Missing files: ~{~%  * ~a~}.~%"
+                  missing)
+          #f)))
+      (missing
+       ;; Try to find the files in the environment.
+       ;; XXX Tell user that we pick the files from the current
+       ;; working directory.
+       ;; XXX These files would need to be mapped into the
+       ;; container.
+       (let* ((found (filter file-exists? missing))
+              (really-missing (lset-difference equal? missing found)))
+         (or (null? really-missing)
+             (begin (format (current-error-port)
+                            "Missing inputs: ~{~%  * ~a~}.~%Provide them with --input=NAME=FILE.~%"
+                            really-missing)
+                    #f))))))
   (define ordered-processes
     (workflow-run-order workflow #:parallel? parallel?))
   (define (run)
@@ -309,7 +314,9 @@ (define* (workflow-run workflow engine
           (runner (process-engine-runner engine)))
       (define process->cache-prefix
         (make-process->cache-prefix workflow
-                                    inputs-map
+                                    (append inputs-map
+                                            (map (lambda (x) (list x x))
+                                                 unspecified-inputs))
                                     ordered-processes
                                     engine))
       (define cached?
-- 
2.22.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] workflow: Consider unspecified free inputs when checking cache.
  2019-06-13  3:48 [PATCH] workflow: Consider unspecified free inputs when checking cache Kyle Meyer
@ 2019-06-24 13:54 ` Ricardo Wurmus
  2019-06-25  4:30   ` Kyle Meyer
  0 siblings, 1 reply; 7+ messages in thread
From: Ricardo Wurmus @ 2019-06-24 13:54 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: gwl-devel


Hi Kyle,

> When deciding whether a process is cached, we take into account the
> modification time and size of free inputs, but only those that are
> explicitly specified on the command line.  As a result, the same
> updated input file that would invalidate the cache when given
> explicitly to --input does _not_ invalidate the cache when not
> specified but instead picked from the current working directory.
>
> Correct this discrepancy by including unspecified free inputs in the
> cache prefix calculation.

Thank you for this patch and my apologies for the delay in reviewing the
change!  It took me a little too long to understand the problem and then
even longer to decide whether we should include this workaround or if
there was an alternative that could improve the handling of unspecified
inputs all together.

I ended up splitting it up into two commits: one that names the values,
the other to pass on the unspecified inputs to
make-process->cache-prefix.  I’m not sure if we should keep picking
inputs from the environment silently and by default, but your patch is
anyway more correct than what we had before.

Thanks again!

--
Ricardo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workflow: Consider unspecified free inputs when checking cache.
  2019-06-24 13:54 ` Ricardo Wurmus
@ 2019-06-25  4:30   ` Kyle Meyer
  2019-06-25 17:33     ` zimoun
  0 siblings, 1 reply; 7+ messages in thread
From: Kyle Meyer @ 2019-06-25  4:30 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Ricardo Wurmus <rekado@elephly.net> writes:

[...]

> Thank you for this patch and my apologies for the delay in reviewing the
> change!

No problem at all (especially since I haven't yet found the free time to
work on cache backend stuff discussed in id:87lfyeox06.fsf@kyleam.com
:x).

> I’m not sure if we should keep picking
> inputs from the environment silently and by default, but your patch is
> anyway more correct than what we had before.

Hmm, for my use case, taking free inputs from the file system based on
the current directory is the only method that I'm actually interested in
(i.e. I don't see myself having any use for --input).  Perhaps my
thinking is too shaped by make/snakemake, and I don't fully grasp the
approach GWL is trying to take.

-- 
Kyle

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workflow: Consider unspecified free inputs when checking cache.
  2019-06-25  4:30   ` Kyle Meyer
@ 2019-06-25 17:33     ` zimoun
  2019-06-25 18:33       ` Ricardo Wurmus
  2019-06-26  1:31       ` Kyle Meyer
  0 siblings, 2 replies; 7+ messages in thread
From: zimoun @ 2019-06-25 17:33 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: Ricardo Wurmus, gwl-devel

Hi,

On Tue, 25 Jun 2019 at 06:30, Kyle Meyer <kyle@kyleam.com> wrote:
>
> Ricardo Wurmus <rekado@elephly.net> writes:

> > I’m not sure if we should keep picking
> > inputs from the environment silently and by default, but your patch is
> > anyway more correct than what we had before.
>
> Hmm, for my use case, taking free inputs from the file system based on
> the current directory is the only method that I'm actually interested in
> (i.e. I don't see myself having any use for --input).  Perhaps my
> thinking is too shaped by make/snakemake, and I don't fully grasp the
> approach GWL is trying to take.

I am not sure to fully understand the issue and all the recent changes.

One idea of GWL is to have a functional workflow: the
multi-composition of functions/processes. And free inputs
are--say--the argument of this function. Therefore, if you have many
samples and you need to apply the same workflow, then you just apply
the function to each sample with --input. I mean it is my
understanding of the approach. Maybe I have wrong...


All the best,
simon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workflow: Consider unspecified free inputs when checking cache.
  2019-06-25 17:33     ` zimoun
@ 2019-06-25 18:33       ` Ricardo Wurmus
  2019-06-26  1:31       ` Kyle Meyer
  1 sibling, 0 replies; 7+ messages in thread
From: Ricardo Wurmus @ 2019-06-25 18:33 UTC (permalink / raw)
  To: zimoun; +Cc: Kyle Meyer, gwl-devel


zimoun <zimon.toutoune@gmail.com> writes:

> On Tue, 25 Jun 2019 at 06:30, Kyle Meyer <kyle@kyleam.com> wrote:
>>
>> Ricardo Wurmus <rekado@elephly.net> writes:
>
>> > I’m not sure if we should keep picking
>> > inputs from the environment silently and by default, but your patch is
>> > anyway more correct than what we had before.
>>
>> Hmm, for my use case, taking free inputs from the file system based on
>> the current directory is the only method that I'm actually interested in
>> (i.e. I don't see myself having any use for --input).  Perhaps my
>> thinking is too shaped by make/snakemake, and I don't fully grasp the
>> approach GWL is trying to take.
>
> I am not sure to fully understand the issue and all the recent changes.
>
> One idea of GWL is to have a functional workflow: the
> multi-composition of functions/processes. And free inputs
> are--say--the argument of this function. Therefore, if you have many
> samples and you need to apply the same workflow, then you just apply
> the function to each sample with --input.

That’s right.

It’s also used for when you have a shared stash of input files
(e.g. genomes) and you simply want to inform the GWL about where those
files are and that they correspond to inputs in the workflow — without
having to copy them first.

For a complicated workflow with lots of inputs specifying inputs can be
really tedious.  For this reason the GWL will also pick up appropriately
named files in the current working directory.

My only concern is whether this dual behaviour should be the default or
if it should be switchable.  (E.g. passing “--pure” would force the user
to specify all inputs with “--input”.)

--
Ricardo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workflow: Consider unspecified free inputs when checking cache.
  2019-06-25 17:33     ` zimoun
  2019-06-25 18:33       ` Ricardo Wurmus
@ 2019-06-26  1:31       ` Kyle Meyer
  2019-06-26  7:05         ` Ricardo Wurmus
  1 sibling, 1 reply; 7+ messages in thread
From: Kyle Meyer @ 2019-06-26  1:31 UTC (permalink / raw)
  To: zimoun; +Cc: Ricardo Wurmus, gwl-devel

Hi simon,

zimoun <zimon.toutoune@gmail.com> writes:

> Hi,
>
> On Tue, 25 Jun 2019 at 06:30, Kyle Meyer <kyle@kyleam.com> wrote:
>>
>> Ricardo Wurmus <rekado@elephly.net> writes:
>
>> > I’m not sure if we should keep picking
>> > inputs from the environment silently and by default, but your patch is
>> > anyway more correct than what we had before.
>>
>> Hmm, for my use case, taking free inputs from the file system based on
>> the current directory is the only method that I'm actually interested in
>> (i.e. I don't see myself having any use for --input).  Perhaps my
>> thinking is too shaped by make/snakemake, and I don't fully grasp the
>> approach GWL is trying to take.
>
> I am not sure to fully understand the issue and all the recent changes.

No need to understand the patch :]  The quoted discussion is pretty
tangential to the issue resolved by the patch.  It's only related in
that it's about unspecified free inputs.

> One idea of GWL is to have a functional workflow: the
> multi-composition of functions/processes. And free inputs
> are--say--the argument of this function. Therefore, if you have many
> samples and you need to apply the same workflow, then you just apply
> the function to each sample with --input. I mean it is my
> understanding of the approach. Maybe I have wrong...

Thanks, that matches my understanding.

To expand on my original comment that --input doesn't fit my desired use
case: I want the workflow to be fully specified.  For me, this
translates to

  (1) all scripts that aren't included in Guix packages are tracked in a
      Git repository

  (2) software dependencies are completely specified via a manifest and
      Guix inferiors

  (3) all input data files are tracked in the repository (via git-annex)

  (4) the workflow itself is tracked in the repository

  (5) the workflow unambiguously specifies how to generate each output.
      I don't want to have to document that "to generate THIS, you
      should run `guix workflow --input=THIS=THAT'".  I want to make a
      blanket statement that "you can run `<workflow cmd> TARGET' to
      generate any desired output, which will do whatever is needed to
      make that happen".  (I know this isn't currently possible with
      GWL.)

Of course none of that is to say above is _the_ way to do it; I'm just
elaborating on what my use case and focus is.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workflow: Consider unspecified free inputs when checking cache.
  2019-06-26  1:31       ` Kyle Meyer
@ 2019-06-26  7:05         ` Ricardo Wurmus
  0 siblings, 0 replies; 7+ messages in thread
From: Ricardo Wurmus @ 2019-06-26  7:05 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: zimoun, gwl-devel


Hi Kyle,

>   (5) the workflow unambiguously specifies how to generate each output.
>       I don't want to have to document that "to generate THIS, you
>       should run `guix workflow --input=THIS=THAT'".  I want to make a
>       blanket statement that "you can run `<workflow cmd> TARGET' to
>       generate any desired output, which will do whatever is needed to
>       make that happen".  (I know this isn't currently possible with
>       GWL.)

Spoken like a Snakemake user :)

The GWL has no fleshed out concept of targets, at least not in the way
that GNU Make presents them to the user.  The GWL’s primary abstraction
is the process.

However, we should be able to tell the GWL to only run those processes
that are required to reach a certain output.  This would require a
procedure that sections the graph of workflow processes.

--
Ricardo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-06-26  7:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-13  3:48 [PATCH] workflow: Consider unspecified free inputs when checking cache Kyle Meyer
2019-06-24 13:54 ` Ricardo Wurmus
2019-06-25  4:30   ` Kyle Meyer
2019-06-25 17:33     ` zimoun
2019-06-25 18:33       ` Ricardo Wurmus
2019-06-26  1:31       ` Kyle Meyer
2019-06-26  7:05         ` Ricardo Wurmus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).