From: Ricardo Wurmus <rekado@elephly.net>
To: zimoun <zimon.toutoune@gmail.com>
Cc: gwl-devel@gnu.org
Subject: Re: support for containers
Date: Wed, 30 Jan 2019 13:46:49 +0100 [thread overview]
Message-ID: <87muninwhy.fsf@elephly.net> (raw)
In-Reply-To: <CAJ3okZ0QGwo6dzdYwEpMBauYN2vYgFMuYMR2jAuk=nTGoLhkZg@mail.gmail.com>
Hi Simon,
> On Wed, 30 Jan 2019 at 00:16, Ricardo Wurmus <rekado@elephly.net> wrote:
>
>> Since we don’t hash the data (because it’s expensive) the scripts are
>> “proxies” for the data files. We compute the hashes over the dependent
>> scripts and assume that this is enough to decide whether to recompute
>> data files or to serve them from the cache/store.
>
> Just to be sure to well understand your point, let pick the simple
> example from genomics pipeline:
> FASTQ -align-> BAM -variant-> VCF
> So, you intend to hash:
> - the data FASTQ
> - the scripts align and variant
> Or only the scripts containing reference to inputs (here FASTQ), where
> the reference is a location fixed by the user.
Currently, there is no good way for a user to pass inputs to a workflow,
so I haven’t yet thought about how to handle the user’s input files.
This still needs to be done. Currently, the only way a user can provide
files as inputs is by writing a process that “generates” the file (even
if it does so by merely accessing the impure file system). That’s
rather inconvenient and it wouldn’t work in a container where only
declared files are available.
Users should be able to map files to any process input from the command
line (or through a configuration file). For a provided input we should
take into account the hash of some file property: the timestamp and the
name (cheap), or the contents (expensive).
As regards hashing the scripts here’s what I have so far:
--8<---------------cut here---------------start------------->8---
(define (workflow->data-hashes workflow engine)
"Return an alist associating each of the WORKFLOW's processes with
the hash of all the process scripts used to generate their outputs."
(define make-script (process->script engine))
(define graph (workflow-restrictions workflow))
;; Compute hashes for chains of scripts.
(define (kons process acc)
(let* ((script (make-script process #:workflow workflow))
(hash (bytevector->u8-list
(sha256 (call-with-input-file script get-bytevector-all)))))
(cons
(cons process
(append hash
;; Hashes of processes this one depends on.
(append-map (cut assoc-ref acc <>)
(or (assoc-ref graph process) '()))))
acc)))
(map (match-lambda
((process . hashes)
(cons process
(bytevector->base32-string
(sha256
(u8-list->bytevector hashes))))))
(fold kons '()
(workflow-run-order workflow #:parallel? #f))))
--8<---------------cut here---------------end--------------->8---
I.e. for any process we want the hash over the script used for the
current process and for all processes that lead up to the current one.
This gives us a hash string for every process. We can then look up
“${GWL_STORE}/${hash}/output-file-name” — if it exists we use it. The
workflow runner would now also need to ensure that process outputs are
linked to the appropriate GWL_STORE location upon successful execution.
--
Ricardo
next prev parent reply other threads:[~2019-01-30 13:33 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-28 23:03 support for containers Ricardo Wurmus
2019-01-29 9:38 ` Ricardo Wurmus
2019-01-29 10:39 ` zimoun
2019-01-29 11:46 ` Ricardo Wurmus
2019-01-29 14:29 ` zimoun
2019-01-29 17:19 ` Ricardo Wurmus
2019-01-29 21:52 ` zimoun
2019-01-29 23:16 ` Ricardo Wurmus
2019-01-30 10:17 ` zimoun
2019-01-30 12:46 ` Ricardo Wurmus [this message]
2019-01-29 10:22 ` zimoun
2019-01-29 11:44 ` Ricardo Wurmus
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87muninwhy.fsf@elephly.net \
--to=rekado@elephly.net \
--cc=gwl-devel@gnu.org \
--cc=zimon.toutoune@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.