unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed
From: Ricardo Wurmus <rekado@elephly.net>
To: Konrad Hinsen <konrad.hinsen@fastmail.net>
Cc: gwl-devel@gnu.org
Subject: Re: Managing data files in workflows
Date: Fri, 26 Mar 2021 09:47:11 +0100	[thread overview]
Message-ID: <87r1k2ti7k.fsf@elephly.net> (raw)
In-Reply-To: <m18s6bk12w.fsf@ordinateur-de-catherine--konrad.home>


Hi Konrad,

> Coming from make-like workflow systems, I wonder how data files are best
> managed in GWL workflow. GWL is clearly less file-centric than make
> (which is a Good Thing in my opinion), but at a first reading of the
> manual, it doesn't seem to care about files at all, except for
> auto-connect.
>
> A simple example:
>
> ==================================================
> process download
>   packages "wget"
>   outputs
>     file "data/weekly-incidence.csv"
>   # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv }
>
> workflow influenza-incidence
>   processes download
> ==================================================

This works for me correctly:

--8<---------------cut here---------------start------------->8---
$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
The following derivations will be built:
   /gnu/store/59isvjs850hm6ipywhaz34zvn0235j2g-gwl-download.scm.drv
   /gnu/store/s8yx15w5zwpz500brl6mv2qf2s9id309-profile.drv

building path(s) `/gnu/store/izhflk47bpimvj3xk3r4ddzaipj87cny-ca-certificate-bundle'
building path(s) `/gnu/store/i7prqy908kfsxsvzksr06gxks2jd3s08-fonts-dir'
building path(s) `/gnu/store/pzcqa593l8msd4m3s0i0a3bx84llzlpa-info-dir'
building path(s) `/gnu/store/7f5i86dw32ikm9czq1v17spnjn61j8z6-manual-database'
Creating manual page database...
[  2/  3] building list of man-db entries...
108 entries processed in 0.1 s
building path(s) `/gnu/store/mrv97q0d2732bk3hmj91znzigxyv1vsh-profile'
building path(s) `/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm'
run: Executing: /bin/sh -c /gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm '((inputs) (outputs "./data/weekly-incidence.csv") (values) (name . "download"))' 
--2021-03-26 09:41:17--  http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17
Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘./data/weekly-incidence.csv’

./data/weekly-incidence.csv                        [ <=>                                                                                                 ]  83.50K  --.-KB/s    in 0.05s   

2021-03-26 09:41:18 (1.63 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509]

$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
run: Skipping process "download" (cached at /tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/).
$ --8<---------------cut here---------------end--------------->8---

Here’s the changed workflow:

--8<---------------cut here---------------start------------->8---
process download
  packages "wget" "coreutils"
  outputs
    file "data/weekly-incidence.csv"
  # {
    mkdir -p $(dirname {{outputs}})
    wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
  }

workflow influenza-incidence
  processes download
--8<---------------cut here---------------end--------------->8---

It skips the process because the output file exists and the daring
assumption we make is that outputs are reproducible.

I would like to make these assumptions explicit in a future version, but
I’m not sure how.  An idea is to add keyword arguments to “file” that
allows us to provide a content hash, or merely a flag to declare a file
as volatile and thus in need of recomputation.

I also wanted to have IPFS and git-annex support, but before I embark on
this I want to understand exactly how this should behave and what the UI
should be.  E.g. having an input that is declared as “IPFS-file” would
cause that input file to be fetched automatically without having to
specify a process that downloads it first.  (Something similar could be
implemented for web resources as in your example.)

-- 
Ricardo


  parent reply	other threads:[~2021-03-26  8:47 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-25  9:57 Managing data files in workflows Konrad Hinsen
2021-03-26  7:02 ` zimoun
2021-03-26 12:46   ` Konrad Hinsen
2021-03-26  8:47 ` Ricardo Wurmus [this message]
2021-03-26 12:30   ` Konrad Hinsen
2021-03-26 12:54     ` Konrad Hinsen
2021-03-26 13:13     ` Ricardo Wurmus
2021-03-26 15:36       ` Konrad Hinsen
2021-04-01 13:27         ` Ricardo Wurmus
2021-04-02  8:41           ` Konrad Hinsen
2021-04-07 11:38             ` Ricardo Wurmus
2021-04-08  7:28               ` Konrad Hinsen
2021-05-03  9:18                 ` Ricardo Wurmus
2021-05-03 11:58                   ` zimoun
2021-05-03 13:47                     ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.guixwl.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r1k2ti7k.fsf@elephly.net \
    --to=rekado@elephly.net \
    --cc=gwl-devel@gnu.org \
    --cc=konrad.hinsen@fastmail.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).