* Managing data files in workflows @ 2021-03-25 9:57 Konrad Hinsen 2021-03-26 7:02 ` zimoun 2021-03-26 8:47 ` Ricardo Wurmus 0 siblings, 2 replies; 15+ messages in thread From: Konrad Hinsen @ 2021-03-25 9:57 UTC (permalink / raw) To: gwl-devel Hi everyone, Coming from make-like workflow systems, I wonder how data files are best managed in GWL workflow. GWL is clearly less file-centric than make (which is a Good Thing in my opinion), but at a first reading of the manual, it doesn't seem to care about files at all, except for auto-connect. A simple example: ================================================== process download packages "wget" outputs file "data/weekly-incidence.csv" # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv } workflow influenza-incidence processes download ================================================== This works fine the first time, but the second time it fails because the output file of the process already exists. This doesn't look very useful. The two behaviors I do see as potentially useful are 1) Always replace the file. 2) Don't run the process if the output file already exists (as make would do by default) I can handle this in my bash code of course, but that becomes lengthy even for this trivial case: ================================================== process download packages "wget" outputs file "data/weekly-incidence.csv" # { rm {{outputs}} wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv } ================================================== ================================================== process download packages "wget" outputs file "data/weekly-incidence.csv" # { test -f {{outputs}} || wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv } ================================================== Is there a better solution? Cheers, Konrad. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-25 9:57 Managing data files in workflows Konrad Hinsen @ 2021-03-26 7:02 ` zimoun 2021-03-26 12:46 ` Konrad Hinsen 2021-03-26 8:47 ` Ricardo Wurmus 1 sibling, 1 reply; 15+ messages in thread From: zimoun @ 2021-03-26 7:02 UTC (permalink / raw) To: Konrad Hinsen, gwl-devel Hi Konrad, It does not answer your concrete question but instead open a new one. :-) Well, I never finished this drafts, maybe it can be worth to discuss 1. how to deal with data? 2. on which does the workflow trigger a recomputation? Cheers, simon -------------------- Start of forwarded message -------------------- Hi, The recent features of the Guix Workflow Language [1] are really neat! The end-to-end paper by Ludo [2] is also really cool! For the online Guix Day back on December, it would have been cool to be able to distribute the videos via a channel. Or it could be cool to have all the material talks [3] in a channel. But a package is not the right abstraction here. First because a “data” can have multiple sources, second data can be really large and third data are not always readable as source and do not have an output; data are kind of fixed output. (Code is data but data is not code. :-)) Note that data is already fetched via packages, see ’r-bsgenome-hsapiens-ucsc-hg19’ or ’r-bsgenome-hsapiens-ucsc-hg38’ (’guix import’ reports ~677.3MiB and ’guix size’ reports ~748.0 MiB). I am not speaking about these. If I might, let take the example of Lars’s talk from Guix Day: <https://www.psycharchives.org/handle/20.500.12034/3938> There is 2 parts: the video itself and the slides. Both are part of the same. Another example is Konrad’s paper: <https://dx.doi.org/10.1063/1.5054887> with the paper and the supplementary (code+data). With these 2 examples, ’package’ with some tweaks could be used. But for the data I deal at work, the /gnu/store is not designed for that. To fix the idea, about (large) genomics study, let say 100 patients and 0.5-10GB data for each. In addition to genomics reference which means a couple of GB. At work, these days we do not have too much new genomic projects; let say there 3 projects in parallel. I let you complete the calculus. ;-) There is 3 levels: 1- the methods for fetching: URL (http or ftp), Git, IPFS, Dat, etc. 2- the record representing a “data” 3- how to effectively locally store and deal with it And if it makes sense that a ’data’ is an input of a ’package’, and conversely, is a question. Long time ago, with GWL folks we discussed “backend”, as git-annex or something else, but from my understanding, it would answer about #3 and what git-annex accepts as protocol would answer to #1. Remaining #2. In my project, I would like to have 3 files: manifest describing which tools, channels describing at which version, and data describing how to fetch the data. Then, I have the tool to work reproducibly: I can apply a workflow (GWL, my custom Python script, etc.). 1: <https://guixwl.org/> 2: <https://hpc.guix.info/blog/2020/06/reproducible-research-articles-from-source-code-to-pdf/> 3: <https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/talks> Cheers, simon -------------------- End of forwarded message -------------------- ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-26 7:02 ` zimoun @ 2021-03-26 12:46 ` Konrad Hinsen 0 siblings, 0 replies; 15+ messages in thread From: Konrad Hinsen @ 2021-03-26 12:46 UTC (permalink / raw) To: zimoun, gwl-devel Hi Simon, > It does not answer your concrete question but instead open a new > one. :-) And a good one! > 1. how to deal with data? > 2. on which does the workflow trigger a recomputation? Number 2 was what I had in mind with my question. And I still wonder how GWL handles it now and/or in some near future. > There is 3 levels: > > 1- the methods for fetching: URL (http or ftp), Git, IPFS, Dat, etc. > 2- the record representing a “data” > 3- how to effectively locally store and deal with it > > And if it makes sense that a ’data’ is an input of a > ’package’, and conversely, is a question. > > Long time ago, with GWL folks we discussed “backend”, as git-annex or > something else, but from my understanding, it would answer about #3 and > what git-annex accepts as protocol would answer to #1. Remaining #2. Perhaps a good first step is to actually use git-annex for big files, and then integrate it more and more into Guix and/or GWL. Multiple backends will certainly be required in the near future, because data storage is not yet sufficiently standardized to pick one specific technology. So why not profit from the work already done in git-annex? One answer to #2 would be to use a git repository. Managed by git-annex, with remotes pointing to the repositories that actually hold the data. Not very elegant, but as a first step, why not. Cheers, Konrad. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-25 9:57 Managing data files in workflows Konrad Hinsen 2021-03-26 7:02 ` zimoun @ 2021-03-26 8:47 ` Ricardo Wurmus 2021-03-26 12:30 ` Konrad Hinsen 1 sibling, 1 reply; 15+ messages in thread From: Ricardo Wurmus @ 2021-03-26 8:47 UTC (permalink / raw) To: Konrad Hinsen; +Cc: gwl-devel Hi Konrad, > Coming from make-like workflow systems, I wonder how data files are best > managed in GWL workflow. GWL is clearly less file-centric than make > (which is a Good Thing in my opinion), but at a first reading of the > manual, it doesn't seem to care about files at all, except for > auto-connect. > > A simple example: > > ================================================== > process download > packages "wget" > outputs > file "data/weekly-incidence.csv" > # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv } > > workflow influenza-incidence > processes download > ================================================== This works for me correctly: --8<---------------cut here---------------start------------->8--- $ guix workflow run foo.w info: Loading workflow file `foo.w'... info: Computing workflow `influenza-incidence'... The following derivations will be built: /gnu/store/59isvjs850hm6ipywhaz34zvn0235j2g-gwl-download.scm.drv /gnu/store/s8yx15w5zwpz500brl6mv2qf2s9id309-profile.drv building path(s) `/gnu/store/izhflk47bpimvj3xk3r4ddzaipj87cny-ca-certificate-bundle' building path(s) `/gnu/store/i7prqy908kfsxsvzksr06gxks2jd3s08-fonts-dir' building path(s) `/gnu/store/pzcqa593l8msd4m3s0i0a3bx84llzlpa-info-dir' building path(s) `/gnu/store/7f5i86dw32ikm9czq1v17spnjn61j8z6-manual-database' Creating manual page database... [ 2/ 3] building list of man-db entries... 108 entries processed in 0.1 s building path(s) `/gnu/store/mrv97q0d2732bk3hmj91znzigxyv1vsh-profile' building path(s) `/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm' run: Executing: /bin/sh -c /gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm '((inputs) (outputs "./data/weekly-incidence.csv") (values) (name . "download"))' --2021-03-26 09:41:17-- http://www.sentiweb.fr/datasets/incidence-PAY-3.csv Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17 Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/csv] Saving to: ‘./data/weekly-incidence.csv’ ./data/weekly-incidence.csv [ <=> ] 83.50K --.-KB/s in 0.05s 2021-03-26 09:41:18 (1.63 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509] $ guix workflow run foo.w info: Loading workflow file `foo.w'... info: Computing workflow `influenza-incidence'... run: Skipping process "download" (cached at /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/). $ --8<---------------cut here---------------end--------------->8--- Here’s the changed workflow: --8<---------------cut here---------------start------------->8--- process download packages "wget" "coreutils" outputs file "data/weekly-incidence.csv" # { mkdir -p $(dirname {{outputs}}) wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv } workflow influenza-incidence processes download --8<---------------cut here---------------end--------------->8--- It skips the process because the output file exists and the daring assumption we make is that outputs are reproducible. I would like to make these assumptions explicit in a future version, but I’m not sure how. An idea is to add keyword arguments to “file” that allows us to provide a content hash, or merely a flag to declare a file as volatile and thus in need of recomputation. I also wanted to have IPFS and git-annex support, but before I embark on this I want to understand exactly how this should behave and what the UI should be. E.g. having an input that is declared as “IPFS-file” would cause that input file to be fetched automatically without having to specify a process that downloads it first. (Something similar could be implemented for web resources as in your example.) -- Ricardo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-26 8:47 ` Ricardo Wurmus @ 2021-03-26 12:30 ` Konrad Hinsen 2021-03-26 12:54 ` Konrad Hinsen 2021-03-26 13:13 ` Ricardo Wurmus 0 siblings, 2 replies; 15+ messages in thread From: Konrad Hinsen @ 2021-03-26 12:30 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Hi Ricardo, Ricardo Wurmus <rekado@elephly.net> writes: > This works for me correctly: Thanks for looking into this! For me, your change makes no difference. Nor should it, because in my setup the "data" directory already exists. I still get an error message about the already existing file. Maybe it's time to switch to the development version of GWL! > It skips the process because the output file exists and the daring > assumption we make is that outputs are reproducible. > > I would like to make these assumptions explicit in a future version, but > I’m not sure how. An idea is to add keyword arguments to “file” that > allows us to provide a content hash, or merely a flag to declare a file > as volatile and thus in need of recomputation. Declaring a file explicitly as volatile or reproducible sounds good. I am less convinced about adding a hash, except for inputs external to the workflow. In my example, the file I download changes on the server once per week, so I'd mark it as volatile. I'd then expect it to be re-downloaded at every execution of the workflow. But I a also OK with doing this manually, i.e. deleting the file if I want it to be replaced. Old make habits never die ;-) > I also wanted to have IPFS and git-annex support, but before I embark on > this I want to understand exactly how this should behave and what the UI > should be. E.g. having an input that is declared as “IPFS-file” would > cause that input file to be fetched automatically without having to > specify a process that downloads it first. (Something similar could be > implemented for web resources as in your example.) Indeed. An extended version of "guix download" for workflows. However, what I had in mind with my question is the management of intermediate results in my workflow, especially in its development phase. If I change my workflow file, or a script that it calls, I'd want only the affected steps to be recomputed. That's not much of an issue for my current test case, but I have bigger dreams for the future ;-) Cheers, Konrad. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-26 12:30 ` Konrad Hinsen @ 2021-03-26 12:54 ` Konrad Hinsen 2021-03-26 13:13 ` Ricardo Wurmus 1 sibling, 0 replies; 15+ messages in thread From: Konrad Hinsen @ 2021-03-26 12:54 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Konrad Hinsen <konrad.hinsen@fastmail.net> writes: > Maybe it's time to switch to the development version of GWL! Not as obvious as I thought: it doesn't build. $ guix build -L channel -f .guix.scm starts out saying The following derivations will be built: /gnu/store/b62d2v6210p8j27fcx7z08xb3lcjw5hi-gwl-0.3.0.drv /gnu/store/prx66i4jvs445g82gkc5sv7p7hhf27ba-guile-lib-0.2.7.drv building /gnu/store/prx66i4jvs445g82gkc5sv7p7hhf27ba-guile-lib-0.2.7.drv... and many lines later fails with /tmp/guix-build-guile-lib-0.2.7.drv-0/guile-lib-0.2.7/build-aux/missing: line 81: makeinfo: command not found WARNING: 'makeinfo' is missing on your system. Note that $ guix build --check guile-lib works just fine, so the failure must somehow involve the additional channel. Unfortunately I don't understand the guile-replacing magic in there well enough for any serious debugging. Cheers, Konrad. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-26 12:30 ` Konrad Hinsen 2021-03-26 12:54 ` Konrad Hinsen @ 2021-03-26 13:13 ` Ricardo Wurmus 2021-03-26 15:36 ` Konrad Hinsen 1 sibling, 1 reply; 15+ messages in thread From: Ricardo Wurmus @ 2021-03-26 13:13 UTC (permalink / raw) To: Konrad Hinsen; +Cc: gwl-devel Konrad Hinsen <konrad.hinsen@fastmail.net> writes: >> This works for me correctly: > > Thanks for looking into this! For me, your change makes no difference. > Nor should it, because in my setup the "data" directory already exists. > I still get an error message about the already existing file. > > Maybe it's time to switch to the development version of GWL! Hmm, I don’t see any commits since 0.3.0 that would affect the cache implementation. GWL computes cache hashes for all processes and the processes they depend on. In your case it’s trivial: there’s just one process. The process definition is hashed and looked up in the cache to see if there is any output for the given process hash. In my test case this file exists: /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/data/weekly-incidence.csv /tmp/gwl is the cache prefix, and the hash corresponds to the process. Since data/weekly-incidence.csv exists and that’s the only declared output, GWL decides not compute the output again. At least that happens in my case. I wonder why it doesn’t work in your case. > However, what I had in mind with my question is the management of > intermediate results in my workflow, especially in its development > phase. If I change my workflow file, or a script that it calls, > I'd want only the affected steps to be recomputed. That's not much > of an issue for my current test case, but I have bigger dreams for > the future ;-) Yes, that’s the way it’s supposed to work already. GWL computes the hashes of each chain of processes, which includes the generated process script, its inputs, and the hashes of all processes that lead up to this process. Any change in the chain will lead to a new hash and thus a cache miss, leading GWL to recompute. -- Ricardo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-26 13:13 ` Ricardo Wurmus @ 2021-03-26 15:36 ` Konrad Hinsen 2021-04-01 13:27 ` Ricardo Wurmus 0 siblings, 1 reply; 15+ messages in thread From: Konrad Hinsen @ 2021-03-26 15:36 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Ricardo Wurmus <rekado@elephly.net> writes: > In my test case this file exists: > > /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/data/weekly-incidence.csv And that's the one that causes the error message for me when I run the workflow a second time (see below). But as I understand now, the mistake happens earlier, as this step shouldn't be executed at all. > At least that happens in my case. I wonder why it doesn’t work in your > case. Is there anything I can do to debug this? > Yes, that’s the way it’s supposed to work already. GWL computes the > hashes of each chain of processes, which includes the generated process > script, its inputs, and the hashes of all processes that lead up to this > process. Any change in the chain will lead to a new hash and thus a > cache miss, leading GWL to recompute. Excellent, that's what I was hoping to happen, given the Guix background. Cheers, Konrad. $ guix workflow run test.w info: Loading workflow file `test.w'... info: Computing workflow `influenza-incidence'... run: Executing: /bin/sh -c /gnu/store/km977swwhqj2n1mg34fq6sv4a41iabkm-gwl-download.scm '((inputs) (outputs "./data/weekly-incidence.csv") (values) (name . "download"))' --2021-03-26 13:55:22-- http://www.sentiweb.fr/datasets/incidence-PAY-3.csv Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17 Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/csv] Saving to: ‘./data/weekly-incidence.csv’ ./data/weekly-incid [ <=> ] 83,50K --.-KB/s in 0,008s 2021-03-26 13:55:22 (10,5 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509] Backtrace: 6 (primitive-load "/home/hinsen/.config/guix/current/bin/guix") In guix/ui.scm: 2164:12 5 (run-guix-command _ . _) In srfi/srfi-1.scm: 460:18 4 (fold #<procedure 7f33df87ea80 at gwl/workflows.scm:388:2 (ite…> …) 460:18 3 (fold #<procedure 7f33df87ea60 at gwl/workflows.scm:391:13 (pr…> …) In gwl/workflows.scm: 392:21 2 (_ #<process download> ()) In srfi/srfi-1.scm: 634:9 1 (for-each #<procedure 7f33df87e460 at gwl/workflows.scm:589:28…> …) In guix/ui.scm: 566:4 0 (_ system-error "symlink" _ _ _) guix/ui.scm:566:4: In procedure symlink: File exists: "/tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra/./data/weekly-incidence.csv" ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-03-26 15:36 ` Konrad Hinsen @ 2021-04-01 13:27 ` Ricardo Wurmus 2021-04-02 8:41 ` Konrad Hinsen 0 siblings, 1 reply; 15+ messages in thread From: Ricardo Wurmus @ 2021-04-01 13:27 UTC (permalink / raw) To: Konrad Hinsen; +Cc: gwl-devel Hi Konrad, > Ricardo Wurmus <rekado@elephly.net> writes: > >> In my test case this file exists: >> >> /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/data/weekly-incidence.csv > > And that's the one that causes the error message for me when I run the > workflow a second time (see below). But as I understand now, the mistake > happens earlier, as this step shouldn't be executed at all. > >> At least that happens in my case. I wonder why it doesn’t work in your >> case. > > Is there anything I can do to debug this? Maybe. You could run with “--dry-run” to see what GWL claims it would do to confirm that it considers the file to be “not cached”. Also enable more log events (in particular cache events) with “--log-events=error,info,execute,cache,debug” The backtrace makes it seem that caching the downloaded file fails. That’s surprising, because (@ (gwl cache) cache!) will delete an existing file in the cache before linking a file to the cache prefix. -- Ricardo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-04-01 13:27 ` Ricardo Wurmus @ 2021-04-02 8:41 ` Konrad Hinsen 2021-04-07 11:38 ` Ricardo Wurmus 0 siblings, 1 reply; 15+ messages in thread From: Konrad Hinsen @ 2021-04-02 8:41 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Hi Ricardo, > Maybe. You could run with “--dry-run” to see what GWL claims it would > do to confirm that it considers the file to be “not cached”. > > Also enable more log events (in particular cache events) with > > “--log-events=error,info,execute,cache,debug” Thanks, I think I made progress with those nice debugging aids. When I run my workflow for the first time, I see cache: Caching `./data/weekly-incidence.csv' as `/tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra/./data/weekly-incidence.csv' The '.' in there looks suspect. Let's see what I got: $ ls -lR /tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra /tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra: total 4 drwxrwxr-x 2 hinsen hinsen 4096 2 avril 10:13 data /tmp/gwl/mwmeuuhnu7sv4mpouj7o5x4se4qp5n5auzhpkb7y7oxidoxzc6ra/data: total 0 lrwxrwxrwx 1 hinsen hinsen 27 2 avril 10:13 weekly-incidence.csv -> ./data/weekly-incidence.csv That's an invalid symbolic link, so it's not surprising that a second run doesn't find the cached file. When I use an absolute filename to refer to my download target, the symlink in the cache is valid and points to the downloaded file. And when I run the workflow a second time, it skips the "download" process as expected. But then, it fails trying to "restore" the file: run: Skipping process "download" (cached at /tmp/gwl/ubvscxwoezl63qmvyfszlf6azmuc655h7gbbtosqshlm5r6ckyhq/). cache: Restoring `/tmp/gwl/ubvscxwoezl63qmvyfszlf6azmuc655h7gbbtosqshlm5r6ckyhq//home/hinsen/projects/mooc-workflows/influenza-analysis/data/weekly-incidence.csv' to `/home/hinsen/projects/mooc-workflows/influenza-analysis/data/weekly-incidence.csv' Backtrace: 6 (primitive-load "/home/hinsen/.config/guix/current/bin/guix") In guix/ui.scm: 2164:12 5 (run-guix-command _ . _) In srfi/srfi-1.scm: 460:18 4 (fold #<procedure 7f45ba1d5c40 at gwl/workflows.scm:388:2 (ite…> …) 460:18 3 (fold #<procedure 7f45ba1d5c20 at gwl/workflows.scm:391:13 (pr…> …) In gwl/workflows.scm: 392:21 2 (_ #<process download> ()) In srfi/srfi-1.scm: 634:9 1 (for-each #<procedure 7f45ba1d57e0 at gwl/workflows.scm:549:26…> …) In guix/ui.scm: 566:4 0 (_ system-error "symlink" _ _ _) guix/ui.scm:566:4: In procedure symlink: Operation not permitted: "/home/hinsen/projects/mooc-workflows/influenza-analysis/data/weekly-incidence.csv" Looking at the source code in (gwl cache), restoring means symlinking the target file to the cached file, which can't work given that the cache is already a symlink to the target file. So... I don't understand how the cache is supposed to work. If it stores symlinks, there is no need to restore anything. If it is supposed to store copies, then that's not what it does. My original issue with the relative filename is a detail that should be easy to fix. Cheers, Konrad. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-04-02 8:41 ` Konrad Hinsen @ 2021-04-07 11:38 ` Ricardo Wurmus 2021-04-08 7:28 ` Konrad Hinsen 0 siblings, 1 reply; 15+ messages in thread From: Ricardo Wurmus @ 2021-04-07 11:38 UTC (permalink / raw) To: Konrad Hinsen; +Cc: gwl-devel Konrad Hinsen <konrad.hinsen@fastmail.net> writes: > Looking at the source code in (gwl cache), restoring means symlinking > the target file to the cached file, which can't work given that the > cache is already a symlink to the target file. > > So... I don't understand how the cache is supposed to work. If it stores > symlinks, there is no need to restore anything. If it is supposed to > store copies, then that's not what it does. Right, that’s really the heart of the problem here. Originally, I used hardlinks exclusively, but they don’t work everywhere. So I added symlinks, but obviously they have different semantics. We can fix the problem with symlinks by restoring the target of the link instead of the link itself, but I feel that we need to take a step back and consider what this cache is really to be used for. The cache assumes again that files are immutable when really they are not guaranteed to be immutable. Both symlinks and hardlinks don’t give us any guarantees. I really would like to have independent copies of input and output files, but I also don’t want to needlessly copy files around or use up more space than absolutely necessary. We could punt on the problem of optimal space consumption and simply copy files to the cache. What do you think? -- Ricardo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-04-07 11:38 ` Ricardo Wurmus @ 2021-04-08 7:28 ` Konrad Hinsen 2021-05-03 9:18 ` Ricardo Wurmus 0 siblings, 1 reply; 15+ messages in thread From: Konrad Hinsen @ 2021-04-08 7:28 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: gwl-devel Hi Ricardo, > We can fix the problem with symlinks by restoring the target of the link > instead of the link itself, but I feel that we need to take a step back > and consider what this cache is really to be used for. Indeed, and I have to admit that this isn't clear to me yet. What is it supposed to protect against? Modification of files by other processes of the workflow? Modification of files outside of the workflow? Both? For the second situation (modification outside of the workflow), I think it would be sufficient to store a checksum, and terminate the workflow with an error if it detects such tampering. The first situation is more difficult. There are actually two cases: 1. The workflow intentionally updates files as it proceeds. 2. The workflow modifies a file by mistake. Only the workflow author can make the distinction, so this needs some specific input syntax. Case 2 could then again be handled by a simple checksum test for signalling an error. This leaves case 1, for which the only good solution is to make a copy of the file at the end of each process, and restore it in later runs. Cheers, Konrad. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-04-08 7:28 ` Konrad Hinsen @ 2021-05-03 9:18 ` Ricardo Wurmus 2021-05-03 11:58 ` zimoun 0 siblings, 1 reply; 15+ messages in thread From: Ricardo Wurmus @ 2021-05-03 9:18 UTC (permalink / raw) To: Konrad Hinsen; +Cc: gwl-devel Konrad Hinsen <konrad.hinsen@fastmail.net> writes: > Hi Ricardo, > >> We can fix the problem with symlinks by restoring the target of >> the link >> instead of the link itself, but I feel that we need to take a >> step back >> and consider what this cache is really to be used for. > > Indeed, and I have to admit that this isn't clear to me > yet. What is it > supposed to protect against? Modification of files by other > processes of > the workflow? Modification of files outside of the workflow? > Both? > > For the second situation (modification outside of the workflow), > I think > it would be sufficient to store a checksum, and terminate the > workflow > with an error if it detects such tampering. > > The first situation is more difficult. There are actually two > cases: > 1. The workflow intentionally updates files as it proceeds. > 2. The workflow modifies a file by mistake. > > Only the workflow author can make the distinction, so this needs > some > specific input syntax. Case 2 could then again be handled by a > simple > checksum test for signalling an error. > > This leaves case 1, for which the only good solution is to make > a copy > of the file at the end of each process, and restore it in later > runs. Yes, you are right. On wip-drmaa I changed the cache to never symlink. It either hardlinks or copies. This solves the immediate problem. Yes, the semantics of hardlink/copy differ, but since our assumption is that intermediate files are reproducible, we can ignore this at this point. I want to make the cache store/restore actions configurable, though, so that you can implement whatever caching method you want (including caching by copying to AWS S3). I’d like to introduce modifiers “immutable” and “mutable”, so that you can write “immutable file "whatever" you "want"” etc. “immutable” would take care of recording hashes and checking previously recorded hashes in a local state directory. -- Ricardo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-05-03 9:18 ` Ricardo Wurmus @ 2021-05-03 11:58 ` zimoun 2021-05-03 13:47 ` Ricardo Wurmus 0 siblings, 1 reply; 15+ messages in thread From: zimoun @ 2021-05-03 11:58 UTC (permalink / raw) To: Ricardo Wurmus, Konrad Hinsen; +Cc: gwl-devel Hi, On Mon, 03 May 2021 at 11:18, Ricardo Wurmus <rekado@elephly.net> wrote: > I’d like to introduce modifiers “immutable” and “mutable”, so that > you can write “immutable file "whatever" you "want"” etc. > “immutable” would take care of recording hashes and checking > previously recorded hashes in a local state directory. Ah, maybe it is same idea that we discussed some days ago on #guix-hpc. To me, everything should be “immutable“ and stored “somewhere”. Somehow, you do not need to hash all the contents but only hash all the inputs hashes and the ’process’ itself. And it is done somehow for packages. The only intense hash is the fixed-output, i.e., the initial data input. Well, I have not sent my explanations because my picture about how GWL currently works is a bit vague for me. And I have not done my homework, yet. Without say, I need to re-read the previous discussions we had on the topic. :-) However, I miss what “mutable” could be? Cheers, simon ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Managing data files in workflows 2021-05-03 11:58 ` zimoun @ 2021-05-03 13:47 ` Ricardo Wurmus 0 siblings, 0 replies; 15+ messages in thread From: Ricardo Wurmus @ 2021-05-03 13:47 UTC (permalink / raw) To: zimoun; +Cc: gwl-devel zimoun <zimon.toutoune@gmail.com> writes: > Hi, > > On Mon, 03 May 2021 at 11:18, Ricardo Wurmus > <rekado@elephly.net> wrote: > >> I’d like to introduce modifiers “immutable” and “mutable”, so >> that >> you can write “immutable file "whatever" you "want"” etc. >> “immutable” would take care of recording hashes and checking >> previously recorded hashes in a local state directory. > > Ah, maybe it is same idea that we discussed some days ago on > #guix-hpc. > > To me, everything should be “immutable“ and stored “somewhere”. > Somehow, you do not need to hash all the contents but only hash > all the > inputs hashes and the ’process’ itself. This is in fact what we do now. Every process code snippet is hashed, and we compute the hashes of the chain of processes to determine whether to re-run something or not. > However, I miss what “mutable” could be? It’s a very generous assumption that intermediate output files are reproducible. Often that’s not actually the case. Having a “mutable” modifier would allow us to indicate that this is in fact the case and require recomputation. -- Ricardo ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-05-03 13:47 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-03-25 9:57 Managing data files in workflows Konrad Hinsen 2021-03-26 7:02 ` zimoun 2021-03-26 12:46 ` Konrad Hinsen 2021-03-26 8:47 ` Ricardo Wurmus 2021-03-26 12:30 ` Konrad Hinsen 2021-03-26 12:54 ` Konrad Hinsen 2021-03-26 13:13 ` Ricardo Wurmus 2021-03-26 15:36 ` Konrad Hinsen 2021-04-01 13:27 ` Ricardo Wurmus 2021-04-02 8:41 ` Konrad Hinsen 2021-04-07 11:38 ` Ricardo Wurmus 2021-04-08 7:28 ` Konrad Hinsen 2021-05-03 9:18 ` Ricardo Wurmus 2021-05-03 11:58 ` zimoun 2021-05-03 13:47 ` Ricardo Wurmus
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.