Managing data files in workflows

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Managing data files in workflows
@ 2021-03-25  9:57 Konrad Hinsen
  2021-03-26  7:02 ` zimoun
  2021-03-26  8:47 ` Ricardo Wurmus
  0 siblings, 2 replies; 15+ messages in thread
From: Konrad Hinsen @ 2021-03-25  9:57 UTC (permalink / raw)
  To: gwl-devel

Hi everyone,

Coming from make-like workflow systems, I wonder how data files are best
managed in GWL workflow. GWL is clearly less file-centric than make
(which is a Good Thing in my opinion), but at a first reading of the
manual, it doesn't seem to care about files at all, except for
auto-connect.

A simple example:

==================================================
process download
  packages "wget"
  outputs
    file "data/weekly-incidence.csv"
  # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv }

workflow influenza-incidence
  processes download
==================================================

This works fine the first time, but the second time it fails because the
output file of the process already exists. This doesn't look very
useful. The two behaviors I do see as potentially useful are

 1) Always replace the file.
 2) Don't run the process if the output file already exists
    (as make would do by default)

I can handle this in my bash code of course, but that becomes lengthy
even for this trivial case:

==================================================
process download
  packages "wget"
  outputs
    file "data/weekly-incidence.csv"
  # {
      rm {{outputs}}
      wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
    }
==================================================

==================================================
process download
  packages "wget"
  outputs
    file "data/weekly-incidence.csv"
  # {
      test -f {{outputs}} || wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
    }
==================================================

Is there a better solution?

Cheers,
  Konrad.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-25  9:57 Managing data files in workflows Konrad Hinsen
@ 2021-03-26  7:02 ` zimoun
  2021-03-26 12:46   ` Konrad Hinsen
  2021-03-26  8:47 ` Ricardo Wurmus
  1 sibling, 1 reply; 15+ messages in thread
From: zimoun @ 2021-03-26  7:02 UTC (permalink / raw)
  To: Konrad Hinsen, gwl-devel

Hi Konrad,

It does not answer your concrete question but instead open a new
one. :-)

Well, I never finished this drafts, maybe it can be worth to discuss

 1. how to deal with data?
 2. on which does the workflow trigger a recomputation?

Cheers,
simon

-------------------- Start of forwarded message --------------------
Hi,

The recent features of the Guix Workflow Language [1] are really neat!
The end-to-end paper by Ludo [2] is also really cool!  For the online
Guix Day back on December, it would have been cool to be able to
distribute the videos via a channel.  Or it could be cool to have all
the material talks [3] in a channel.

But a package is not the right abstraction here.  First because a “data”
can have multiple sources, second data can be really large and third
data are not always readable as source and do not have an output; data
are kind of fixed output.  (Code is data but data is not code. :-))

Note that data is already fetched via packages, see
’r-bsgenome-hsapiens-ucsc-hg19’ or ’r-bsgenome-hsapiens-ucsc-hg38’
(’guix import’ reports ~677.3MiB and ’guix size’ reports ~748.0 MiB).  I
am not speaking about these.

If I might, let take the example of Lars’s talk from Guix Day:

  <https://www.psycharchives.org/handle/20.500.12034/3938>

There is 2 parts: the video itself and the slides.  Both are part of the
same.  Another example is Konrad’s paper:

  <https://dx.doi.org/10.1063/1.5054887>

with the paper and the supplementary (code+data).

With these 2 examples, ’package’ with some tweaks could be used.  But
for the data I deal at work, the /gnu/store is not designed for that.
To fix the idea, about (large) genomics study, let say 100 patients and
0.5-10GB data for each.  In addition to genomics reference which means a
couple of GB.  At work, these days we do not have too much new genomic
projects; let say there 3 projects in parallel.  I let you complete the
calculus. ;-)

There is 3 levels:

 1- the methods for fetching: URL (http or ftp), Git, IPFS, Dat, etc.
 2- the record representing a “data”
 3- how to effectively locally store and deal with it

And if it makes sense that a ’data’ is an input of a
’package’, and conversely, is a question.

Long time ago, with GWL folks we discussed “backend”, as git-annex or
something else, but from my understanding, it would answer about #3 and
what git-annex accepts as protocol would answer to #1.  Remaining #2.

In my project, I would like to have 3 files: manifest describing which
tools, channels describing at which version, and data describing how to
fetch the data.  Then, I have the tool to work reproducibly: I can apply
a workflow (GWL, my custom Python script, etc.).

1: <https://guixwl.org/>
2: <https://hpc.guix.info/blog/2020/06/reproducible-research-articles-from-source-code-to-pdf/>
3: <https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/talks>

Cheers,
simon
-------------------- End of forwarded message --------------------

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-26  7:02 ` zimoun
@ 2021-03-26 12:46   ` Konrad Hinsen
  0 siblings, 0 replies; 15+ messages in thread
From: Konrad Hinsen @ 2021-03-26 12:46 UTC (permalink / raw)
  To: zimoun, gwl-devel

Hi Simon,

> It does not answer your concrete question but instead open a new
> one. :-)

And a good one!

>  1. how to deal with data?
>  2. on which does the workflow trigger a recomputation?

Number 2 was what I had in mind with my question. And I still wonder
how GWL handles it now and/or in some near future.

> There is 3 levels:
>
>  1- the methods for fetching: URL (http or ftp), Git, IPFS, Dat, etc.
>  2- the record representing a “data”
>  3- how to effectively locally store and deal with it
>
> And if it makes sense that a ’data’ is an input of a
> ’package’, and conversely, is a question.
>
> Long time ago, with GWL folks we discussed “backend”, as git-annex or
> something else, but from my understanding, it would answer about #3 and
> what git-annex accepts as protocol would answer to #1.  Remaining #2.

Perhaps a good first step is to actually use git-annex for big files,
and then integrate it more and more into Guix and/or GWL. Multiple
backends will certainly be required in the near future, because data
storage is not yet sufficiently standardized to pick one specific
technology. So why not profit from the work already done in git-annex?

One answer to #2 would be to use a git repository. Managed by git-annex,
with remotes pointing to the repositories that actually hold the data.
Not very elegant, but as a first step, why not.

Cheers,
  Konrad.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-25  9:57 Managing data files in workflows Konrad Hinsen
  2021-03-26  7:02 ` zimoun
@ 2021-03-26  8:47 ` Ricardo Wurmus
  2021-03-26 12:30   ` Konrad Hinsen
  1 sibling, 1 reply; 15+ messages in thread
From: Ricardo Wurmus @ 2021-03-26  8:47 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: gwl-devel


Hi Konrad,

> Coming from make-like workflow systems, I wonder how data files are best
> managed in GWL workflow. GWL is clearly less file-centric than make
> (which is a Good Thing in my opinion), but at a first reading of the
> manual, it doesn't seem to care about files at all, except for
> auto-connect.
>
> A simple example:
>
> ==================================================
> process download
>   packages "wget"
>   outputs
>     file "data/weekly-incidence.csv"
>   # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv }
>
> workflow influenza-incidence
>   processes download
> ==================================================

This works for me correctly:

--8<---------------cut here---------------start------------->8---
$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
The following derivations will be built:
   /gnu/store/59isvjs850hm6ipywhaz34zvn0235j2g-gwl-download.scm.drv
   /gnu/store/s8yx15w5zwpz500brl6mv2qf2s9id309-profile.drv

building path(s) `/gnu/store/izhflk47bpimvj3xk3r4ddzaipj87cny-ca-certificate-bundle'
building path(s) `/gnu/store/i7prqy908kfsxsvzksr06gxks2jd3s08-fonts-dir'
building path(s) `/gnu/store/pzcqa593l8msd4m3s0i0a3bx84llzlpa-info-dir'
building path(s) `/gnu/store/7f5i86dw32ikm9czq1v17spnjn61j8z6-manual-database'
Creating manual page database...
[  2/  3] building list of man-db entries...
108 entries processed in 0.1 s
building path(s) `/gnu/store/mrv97q0d2732bk3hmj91znzigxyv1vsh-profile'
building path(s) `/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm'
run: Executing: /bin/sh -c /gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm '((inputs) (outputs "./data/weekly-incidence.csv") (values) (name . "download"))' 
--2021-03-26 09:41:17--  http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17
Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘./data/weekly-incidence.csv’

./data/weekly-incidence.csv                        [ <=>                                                                                                 ]  83.50K  --.-KB/s    in 0.05s   

2021-03-26 09:41:18 (1.63 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509]

$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
run: Skipping process "download" (cached at /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/).
$ --8<---------------cut here---------------end--------------->8---

Here’s the changed workflow:

--8<---------------cut here---------------start------------->8---
process download
  packages "wget" "coreutils"
  outputs
    file "data/weekly-incidence.csv"
  # {
    mkdir -p $(dirname {{outputs}})
    wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
  }

workflow influenza-incidence
  processes download
--8<---------------cut here---------------end--------------->8---

It skips the process because the output file exists and the daring
assumption we make is that outputs are reproducible.

I would like to make these assumptions explicit in a future version, but
I’m not sure how.  An idea is to add keyword arguments to “file” that
allows us to provide a content hash, or merely a flag to declare a file
as volatile and thus in need of recomputation.

I also wanted to have IPFS and git-annex support, but before I embark on
this I want to understand exactly how this should behave and what the UI
should be.  E.g. having an input that is declared as “IPFS-file” would
cause that input file to be fetched automatically without having to
specify a process that downloads it first.  (Something similar could be
implemented for web resources as in your example.)

-- 
Ricardo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-26  8:47 ` Ricardo Wurmus
@ 2021-03-26 12:30   ` Konrad Hinsen
  2021-03-26 12:54     ` Konrad Hinsen
  2021-03-26 13:13     ` Ricardo Wurmus
  0 siblings, 2 replies; 15+ messages in thread
From: Konrad Hinsen @ 2021-03-26 12:30 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Hi Ricardo,

Ricardo Wurmus <rekado@elephly.net> writes:

> This works for me correctly:

Thanks for looking into this! For me, your change makes no difference.
Nor should it, because in my setup the "data" directory already exists.
I still get an error message about the already existing file.

Maybe it's time to switch to the development version of GWL!

> It skips the process because the output file exists and the daring
> assumption we make is that outputs are reproducible.
>
> I would like to make these assumptions explicit in a future version, but
> I’m not sure how.  An idea is to add keyword arguments to “file” that
> allows us to provide a content hash, or merely a flag to declare a file
> as volatile and thus in need of recomputation.

Declaring a file explicitly as volatile or reproducible sounds good. I
am less convinced about adding a hash, except for inputs external to the
workflow.

In my example, the file I download changes on the server once per week,
so I'd mark it as volatile. I'd then expect it to be re-downloaded at
every execution of the workflow. But I a also OK with doing this
manually, i.e. deleting the file if I want it to be replaced. Old make
habits never die ;-)

> I also wanted to have IPFS and git-annex support, but before I embark on
> this I want to understand exactly how this should behave and what the UI
> should be.  E.g. having an input that is declared as “IPFS-file” would
> cause that input file to be fetched automatically without having to
> specify a process that downloads it first.  (Something similar could be
> implemented for web resources as in your example.)

Indeed. An extended version of "guix download" for workflows.

However, what I had in mind with my question is the management of
intermediate results in my workflow, especially in its development
phase. If I change my workflow file, or a script that it calls,
I'd want only the affected steps to be recomputed. That's not much
of an issue for my current test case, but I have bigger dreams for
the future ;-)

Cheers,
  Konrad.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-26 12:30   ` Konrad Hinsen
@ 2021-03-26 12:54     ` Konrad Hinsen
  2021-03-26 13:13     ` Ricardo Wurmus
  1 sibling, 0 replies; 15+ messages in thread
From: Konrad Hinsen @ 2021-03-26 12:54 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Konrad Hinsen <konrad.hinsen@fastmail.net> writes:

> Maybe it's time to switch to the development version of GWL!

Not as obvious as I thought: it doesn't build.

    $ guix build -L channel -f .guix.scm

starts out saying
 
   The following derivations will be built:
       /gnu/store/b62d2v6210p8j27fcx7z08xb3lcjw5hi-gwl-0.3.0.drv
       /gnu/store/prx66i4jvs445g82gkc5sv7p7hhf27ba-guile-lib-0.2.7.drv
    building /gnu/store/prx66i4jvs445g82gkc5sv7p7hhf27ba-guile-lib-0.2.7.drv...

and many lines later fails with

    /tmp/guix-build-guile-lib-0.2.7.drv-0/guile-lib-0.2.7/build-aux/missing: line 81: makeinfo: command not found
    WARNING: 'makeinfo' is missing on your system.

Note that

    $ guix build --check guile-lib

works just fine, so the failure must somehow involve the additional
channel. Unfortunately I don't understand the guile-replacing magic in
there well enough for any serious debugging.

Cheers,
  Konrad.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-26 12:30   ` Konrad Hinsen
  2021-03-26 12:54     ` Konrad Hinsen
@ 2021-03-26 13:13     ` Ricardo Wurmus
  2021-03-26 15:36       ` Konrad Hinsen
  1 sibling, 1 reply; 15+ messages in thread
From: Ricardo Wurmus @ 2021-03-26 13:13 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: gwl-devel

Konrad Hinsen <konrad.hinsen@fastmail.net> writes:

>> This works for me correctly:
>
> Thanks for looking into this! For me, your change makes no difference.
> Nor should it, because in my setup the "data" directory already exists.
> I still get an error message about the already existing file.
>
> Maybe it's time to switch to the development version of GWL!

Hmm, I don’t see any commits since 0.3.0 that would affect the cache
implementation.  GWL computes cache hashes for all processes and the
processes they depend on.  In your case it’s trivial: there’s just one
process.  The process definition is hashed and looked up in the cache
to see if there is any output for the given process hash.

In my test case this file exists:

    /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/data/weekly-incidence.csv

/tmp/gwl is the cache prefix, and the hash corresponds to the process.
Since data/weekly-incidence.csv exists and that’s the only declared
output, GWL decides not compute the output again.

At least that happens in my case.  I wonder why it doesn’t work in your
case.

> However, what I had in mind with my question is the management of
> intermediate results in my workflow, especially in its development
> phase. If I change my workflow file, or a script that it calls,
> I'd want only the affected steps to be recomputed. That's not much
> of an issue for my current test case, but I have bigger dreams for
> the future ;-)

Yes, that’s the way it’s supposed to work already.  GWL computes the
hashes of each chain of processes, which includes the generated process
script, its inputs, and the hashes of all processes that lead up to this
process.  Any change in the chain will lead to a new hash and thus a
cache miss, leading GWL to recompute.

-- 
Ricardo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-26 13:13     ` Ricardo Wurmus
@ 2021-03-26 15:36       ` Konrad Hinsen
  2021-04-01 13:27         ` Ricardo Wurmus
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Hinsen @ 2021-03-26 15:36 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Ricardo Wurmus <rekado@elephly.net> writes:

> In my test case this file exists:
>
>     /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/data/weekly-incidence.csv

And that's the one that causes the error message for me when I run the
workflow a second time (see below). But as I understand now, the mistake
happens earlier, as this step shouldn't be executed at all.

> At least that happens in my case.  I wonder why it doesn’t work in your
> case.

Is there anything I can do to debug this?

> Yes, that’s the way it’s supposed to work already.  GWL computes the
> hashes of each chain of processes, which includes the generated process
> script, its inputs, and the hashes of all processes that lead up to this
> process.  Any change in the chain will lead to a new hash and thus a
> cache miss, leading GWL to recompute.

Excellent, that's what I was hoping to happen, given the Guix
background.

Cheers,
  Konrad.


$ guix workflow run test.w
info: Loading workflow file `test.w'...
info: Computing workflow `influenza-incidence'...
run: Executing: /bin/sh -c /gnu/store/km977swwhqj2n1mg34fq6sv4a41iabkm-gwl-download.scm '((inputs) (outputs "./data/weekly-incidence.csv") (values) (name . "download"))' 
--2021-03-26 13:55:22--  http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17
Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘./data/weekly-incidence.csv’

./data/weekly-incid     [ <=>                ]  83,50K  --.-KB/s    in 0,008s  

2021-03-26 13:55:22 (10,5 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509]

Backtrace:
           6 (primitive-load "/home/hinsen/.config/guix/current/bin/guix")
In guix/ui.scm:
  2164:12  5 (run-guix-command _ . _)
In srfi/srfi-1.scm:
   460:18  4 (fold #<procedure 7f33df87ea80 at gwl/workflows.scm:388:2 (ite…> …)
   460:18  3 (fold #<procedure 7f33df87ea60 at gwl/workflows.scm:391:13 (pr…> …)
In gwl/workflows.scm:
   392:21  2 (_ #<process download> ())
In srfi/srfi-1.scm:
    634:9  1 (for-each #<procedure 7f33df87e460 at gwl/workflows.scm:589:28…> …)
In guix/ui.scm:
    566:4  0 (_ system-error "symlink" _ _ _)

guix/ui.scm:566:4: In procedure symlink: File exists: "/tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra/./data/weekly-incidence.csv"


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-03-26 15:36       ` Konrad Hinsen
@ 2021-04-01 13:27         ` Ricardo Wurmus
  2021-04-02  8:41           ` Konrad Hinsen
  0 siblings, 1 reply; 15+ messages in thread
From: Ricardo Wurmus @ 2021-04-01 13:27 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: gwl-devel


Hi Konrad,

> Ricardo Wurmus <rekado@elephly.net> writes:
>
>> In my test case this file exists:
>>
>>     /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/data/weekly-incidence.csv
>
> And that's the one that causes the error message for me when I run the
> workflow a second time (see below). But as I understand now, the mistake
> happens earlier, as this step shouldn't be executed at all.
>
>> At least that happens in my case.  I wonder why it doesn’t work in your
>> case.
>
> Is there anything I can do to debug this?

Maybe.  You could run with “--dry-run” to see what GWL claims it would
do to confirm that it considers the file to be “not cached”.

Also enable more log events (in particular cache events) with

“--log-events=error,info,execute,cache,debug”

The backtrace makes it seem that caching the downloaded file fails.
That’s surprising, because (@ (gwl cache) cache!) will delete an
existing file in the cache before linking a file to the cache prefix.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-04-01 13:27         ` Ricardo Wurmus
@ 2021-04-02  8:41           ` Konrad Hinsen
  2021-04-07 11:38             ` Ricardo Wurmus
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Hinsen @ 2021-04-02  8:41 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Hi Ricardo,

> Maybe.  You could run with “--dry-run” to see what GWL claims it would
> do to confirm that it considers the file to be “not cached”.
>
> Also enable more log events (in particular cache events) with
>
> “--log-events=error,info,execute,cache,debug”

Thanks, I think I made progress with those nice debugging aids.

When I run my workflow for the first time, I see

  cache: Caching `./data/weekly-incidence.csv' as
  `/tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra/./data/weekly-incidence.csv'

The '.' in there looks suspect. Let's see what I got:

   $ ls -lR /tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra
   /tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra:
   total 4
   drwxrwxr-x 2 hinsen hinsen 4096  2 avril 10:13 data

   /tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra/data:
   total 0
   lrwxrwxrwx 1 hinsen hinsen 27  2 avril 10:13 weekly-incidence.csv -> ./data/weekly-incidence.csv

That's an invalid symbolic link, so it's not surprising that a second
run doesn't find the cached file.

When I use an absolute filename to refer to my download target, the
symlink in the cache is valid and points to the downloaded file. And
when I run the workflow a second time, it skips the "download" process
as expected. But then, it fails trying to "restore" the file:

   run: Skipping process "download" (cached at /tmp/gwl/ubvscxwoezl63qmvyfszlf6azmuc655h7gbbtosqshlm5r6ckyhq/).
   cache: Restoring `/tmp/gwl/ubvscxwoezl63qmvyfszlf6azmuc655h7gbbtosqshlm5r6ckyhq//home/hinsen/projects/mooc-workflows/influenza-analysis/data/weekly-incidence.csv' to `/home/hinsen/projects/mooc-workflows/influenza-analysis/data/weekly-incidence.csv'
   Backtrace:
              6 (primitive-load "/home/hinsen/.config/guix/current/bin/guix")
   In guix/ui.scm:
     2164:12  5 (run-guix-command _ . _)
   In srfi/srfi-1.scm:
      460:18  4 (fold #<procedure 7f45ba1d5c40 at gwl/workflows.scm:388:2 (ite…> …)
      460:18  3 (fold #<procedure 7f45ba1d5c20 at gwl/workflows.scm:391:13 (pr…> …)
   In gwl/workflows.scm:
      392:21  2 (_ #<process download> ())
   In srfi/srfi-1.scm:
       634:9  1 (for-each #<procedure 7f45ba1d57e0 at gwl/workflows.scm:549:26…> …)
   In guix/ui.scm:
       566:4  0 (_ system-error "symlink" _ _ _)

   guix/ui.scm:566:4: In procedure symlink: Operation not permitted: "/home/hinsen/projects/mooc-workflows/influenza-analysis/data/weekly-incidence.csv"

Looking at the source code in (gwl cache), restoring means symlinking
the target file to the cached file, which can't work given that the
cache is already a symlink to the target file.

So... I don't understand how the cache is supposed to work. If it stores
symlinks, there is no need to restore anything. If it is supposed to
store copies, then that's not what it does. My original issue with the
relative filename is a detail that should be easy to fix.

Cheers,
  Konrad.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-04-02  8:41           ` Konrad Hinsen
@ 2021-04-07 11:38             ` Ricardo Wurmus
  2021-04-08  7:28               ` Konrad Hinsen
  0 siblings, 1 reply; 15+ messages in thread
From: Ricardo Wurmus @ 2021-04-07 11:38 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: gwl-devel

Konrad Hinsen <konrad.hinsen@fastmail.net> writes:

> Looking at the source code in (gwl cache), restoring means symlinking
> the target file to the cached file, which can't work given that the
> cache is already a symlink to the target file.
>
> So... I don't understand how the cache is supposed to work. If it stores
> symlinks, there is no need to restore anything. If it is supposed to
> store copies, then that's not what it does.

Right, that’s really the heart of the problem here.  Originally, I used
hardlinks exclusively, but they don’t work everywhere.  So I added
symlinks, but obviously they have different semantics.

We can fix the problem with symlinks by restoring the target of the link
instead of the link itself, but I feel that we need to take a step back
and consider what this cache is really to be used for.

The cache assumes again that files are immutable when really they are
not guaranteed to be immutable.  Both symlinks and hardlinks don’t give
us any guarantees.

I really would like to have independent copies of input and output
files, but I also don’t want to needlessly copy files around or use up
more space than absolutely necessary.   We could punt on the problem of
optimal space consumption and simply copy files to the cache.

What do you think?

-- 
Ricardo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-04-07 11:38             ` Ricardo Wurmus
@ 2021-04-08  7:28               ` Konrad Hinsen
  2021-05-03  9:18                 ` Ricardo Wurmus
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Hinsen @ 2021-04-08  7:28 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Hi Ricardo,

> We can fix the problem with symlinks by restoring the target of the link
> instead of the link itself, but I feel that we need to take a step back
> and consider what this cache is really to be used for.

Indeed, and I have to admit that this isn't clear to me yet. What is it
supposed to protect against? Modification of files by other processes of
the workflow? Modification of files outside of the workflow? Both?

For the second situation (modification outside of the workflow), I think
it would be sufficient to store a checksum, and terminate the workflow
with an error if it detects such tampering.

The first situation is more difficult. There are actually two cases:
 1. The workflow intentionally updates files as it proceeds.
 2. The workflow modifies a file by mistake.

Only the workflow author can make the distinction, so this needs some
specific input syntax. Case 2 could then again be handled by a simple
checksum test for signalling an error.

This leaves case 1, for which the only good solution is to make a copy
of the file at the end of each process, and restore it in later runs.

Cheers,
  Konrad.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-04-08  7:28               ` Konrad Hinsen
@ 2021-05-03  9:18                 ` Ricardo Wurmus
  2021-05-03 11:58                   ` zimoun
  0 siblings, 1 reply; 15+ messages in thread
From: Ricardo Wurmus @ 2021-05-03  9:18 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: gwl-devel

Konrad Hinsen <konrad.hinsen@fastmail.net> writes:

> Hi Ricardo,
>
>> We can fix the problem with symlinks by restoring the target of 
>> the link
>> instead of the link itself, but I feel that we need to take a 
>> step back
>> and consider what this cache is really to be used for.
>
> Indeed, and I have to admit that this isn't clear to me 
> yet. What is it
> supposed to protect against? Modification of files by other 
> processes of
> the workflow? Modification of files outside of the workflow? 
> Both?
>
> For the second situation (modification outside of the workflow), 
> I think
> it would be sufficient to store a checksum, and terminate the 
> workflow
> with an error if it detects such tampering.
>
> The first situation is more difficult. There are actually two 
> cases:
>  1. The workflow intentionally updates files as it proceeds.
>  2. The workflow modifies a file by mistake.
>
> Only the workflow author can make the distinction, so this needs 
> some
> specific input syntax. Case 2 could then again be handled by a 
> simple
> checksum test for signalling an error.
>
> This leaves case 1, for which the only good solution is to make 
> a copy
> of the file at the end of each process, and restore it in later 
> runs.

Yes, you are right.  On wip-drmaa I changed the cache to never 
symlink.  It either hardlinks or copies.  This solves the 
immediate problem.

Yes, the semantics of hardlink/copy differ, but since our 
assumption is that intermediate files are reproducible, we can 
ignore this at this point.

I want to make the cache store/restore actions configurable, 
though, so that you can implement whatever caching method you want 
(including caching by copying to AWS S3).  

I’d like to introduce modifiers “immutable” and “mutable”, so that 
you can write “immutable file "whatever" you "want"” etc. 
“immutable” would take care of recording hashes and checking 
previously recorded hashes in a local state directory.

-- 
Ricardo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-05-03  9:18                 ` Ricardo Wurmus
@ 2021-05-03 11:58                   ` zimoun
  2021-05-03 13:47                     ` Ricardo Wurmus
  0 siblings, 1 reply; 15+ messages in thread
From: zimoun @ 2021-05-03 11:58 UTC (permalink / raw)
  To: Ricardo Wurmus, Konrad Hinsen; +Cc: gwl-devel

Hi,

On Mon, 03 May 2021 at 11:18, Ricardo Wurmus <rekado@elephly.net> wrote:

> I’d like to introduce modifiers “immutable” and “mutable”, so that 
> you can write “immutable file "whatever" you "want"” etc. 
> “immutable” would take care of recording hashes and checking 
> previously recorded hashes in a local state directory.

Ah, maybe it is same idea that we discussed some days ago on #guix-hpc.

To me, everything should be “immutable“ and stored “somewhere”.
Somehow, you do not need to hash all the contents but only hash all the
inputs hashes and the ’process’ itself.  And it is done somehow for
packages.  The only intense hash is the fixed-output, i.e., the initial
data input.

Well, I have not sent my explanations because my picture about how GWL
currently works is a bit vague for me.  And I have not done my homework,
yet.  Without say, I need to re-read the previous discussions we had on
the topic. :-)

However, I miss what “mutable” could be?

Cheers,
simon

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Managing data files in workflows
  2021-05-03 11:58                   ` zimoun
@ 2021-05-03 13:47                     ` Ricardo Wurmus
  0 siblings, 0 replies; 15+ messages in thread
From: Ricardo Wurmus @ 2021-05-03 13:47 UTC (permalink / raw)
  To: zimoun; +Cc: gwl-devel


zimoun <zimon.toutoune@gmail.com> writes:

> Hi,
>
> On Mon, 03 May 2021 at 11:18, Ricardo Wurmus 
> <rekado@elephly.net> wrote:
>
>> I’d like to introduce modifiers “immutable” and “mutable”, so 
>> that 
>> you can write “immutable file "whatever" you "want"” etc. 
>> “immutable” would take care of recording hashes and checking 
>> previously recorded hashes in a local state directory.
>
> Ah, maybe it is same idea that we discussed some days ago on 
> #guix-hpc.
>
> To me, everything should be “immutable“ and stored “somewhere”.
> Somehow, you do not need to hash all the contents but only hash 
> all the
> inputs hashes and the ’process’ itself.

This is in fact what we do now.  Every process code snippet is 
hashed, and we compute the hashes of the chain of processes to 
determine whether to re-run something or not.

> However, I miss what “mutable” could be?

It’s a very generous assumption that intermediate output files are 
reproducible.  Often that’s not actually the case.  Having a 
“mutable” modifier would allow us to indicate that this is in fact 
the case and require recomputation.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-05-03 13:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-03-25  9:57 Managing data files in workflows Konrad Hinsen
2021-03-26  7:02 ` zimoun
2021-03-26 12:46   ` Konrad Hinsen
2021-03-26  8:47 ` Ricardo Wurmus
2021-03-26 12:30   ` Konrad Hinsen
2021-03-26 12:54     ` Konrad Hinsen
2021-03-26 13:13     ` Ricardo Wurmus
2021-03-26 15:36       ` Konrad Hinsen
2021-04-01 13:27         ` Ricardo Wurmus
2021-04-02  8:41           ` Konrad Hinsen
2021-04-07 11:38             ` Ricardo Wurmus
2021-04-08  7:28               ` Konrad Hinsen
2021-05-03  9:18                 ` Ricardo Wurmus
2021-05-03 11:58                   ` zimoun
2021-05-03 13:47                     ` Ricardo Wurmus

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.