From: Ricardo Wurmus <rekado@elephly.net>
To: Konrad Hinsen <konrad.hinsen@fastmail.net>
Cc: gwl-devel@gnu.org
Subject: Re: Managing data files in workflows
Date: Fri, 26 Mar 2021 09:47:11 +0100 [thread overview]
Message-ID: <87r1k2ti7k.fsf@elephly.net> (raw)
In-Reply-To: <m18s6bk12w.fsf@ordinateur-de-catherine--konrad.home>
Hi Konrad,
> Coming from make-like workflow systems, I wonder how data files are best
> managed in GWL workflow. GWL is clearly less file-centric than make
> (which is a Good Thing in my opinion), but at a first reading of the
> manual, it doesn't seem to care about files at all, except for
> auto-connect.
>
> A simple example:
>
> ==================================================
> process download
> packages "wget"
> outputs
> file "data/weekly-incidence.csv"
> # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv }
>
> workflow influenza-incidence
> processes download
> ==================================================
This works for me correctly:
--8<---------------cut here---------------start------------->8---
$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
The following derivations will be built:
/gnu/store/59isvjs850hm6ipywhaz34zvn0235j2g-gwl-download.scm.drv
/gnu/store/s8yx15w5zwpz500brl6mv2qf2s9id309-profile.drv
building path(s) `/gnu/store/izhflk47bpimvj3xk3r4ddzaipj87cny-ca-certificate-bundle'
building path(s) `/gnu/store/i7prqy908kfsxsvzksr06gxks2jd3s08-fonts-dir'
building path(s) `/gnu/store/pzcqa593l8msd4m3s0i0a3bx84llzlpa-info-dir'
building path(s) `/gnu/store/7f5i86dw32ikm9czq1v17spnjn61j8z6-manual-database'
Creating manual page database...
[ 2/ 3] building list of man-db entries...
108 entries processed in 0.1 s
building path(s) `/gnu/store/mrv97q0d2732bk3hmj91znzigxyv1vsh-profile'
building path(s) `/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm'
run: Executing: /bin/sh -c /gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm '((inputs) (outputs "./data/weekly-incidence.csv") (values) (name . "download"))'
--2021-03-26 09:41:17-- http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17
Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘./data/weekly-incidence.csv’
./data/weekly-incidence.csv [ <=> ] 83.50K --.-KB/s in 0.05s
2021-03-26 09:41:18 (1.63 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509]
$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
run: Skipping process "download" (cached at /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/).
$ --8<---------------cut here---------------end--------------->8---
Here’s the changed workflow:
--8<---------------cut here---------------start------------->8---
process download
packages "wget" "coreutils"
outputs
file "data/weekly-incidence.csv"
# {
mkdir -p $(dirname {{outputs}})
wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
}
workflow influenza-incidence
processes download
--8<---------------cut here---------------end--------------->8---
It skips the process because the output file exists and the daring
assumption we make is that outputs are reproducible.
I would like to make these assumptions explicit in a future version, but
I’m not sure how. An idea is to add keyword arguments to “file” that
allows us to provide a content hash, or merely a flag to declare a file
as volatile and thus in need of recomputation.
I also wanted to have IPFS and git-annex support, but before I embark on
this I want to understand exactly how this should behave and what the UI
should be. E.g. having an input that is declared as “IPFS-file” would
cause that input file to be fetched automatically without having to
specify a process that downloads it first. (Something similar could be
implemented for web resources as in your example.)
--
Ricardo
next prev parent reply other threads:[~2021-03-26 8:47 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-25 9:57 Managing data files in workflows Konrad Hinsen
2021-03-26 7:02 ` zimoun
2021-03-26 12:46 ` Konrad Hinsen
2021-03-26 8:47 ` Ricardo Wurmus [this message]
2021-03-26 12:30 ` Konrad Hinsen
2021-03-26 12:54 ` Konrad Hinsen
2021-03-26 13:13 ` Ricardo Wurmus
2021-03-26 15:36 ` Konrad Hinsen
2021-04-01 13:27 ` Ricardo Wurmus
2021-04-02 8:41 ` Konrad Hinsen
2021-04-07 11:38 ` Ricardo Wurmus
2021-04-08 7:28 ` Konrad Hinsen
2021-05-03 9:18 ` Ricardo Wurmus
2021-05-03 11:58 ` zimoun
2021-05-03 13:47 ` Ricardo Wurmus
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.guixwl.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r1k2ti7k.fsf@elephly.net \
--to=rekado@elephly.net \
--cc=gwl-devel@gnu.org \
--cc=konrad.hinsen@fastmail.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).