unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed
From: Ricardo Wurmus <rekado@elephly.net>
To: zimoun <zimon.toutoune@gmail.com>
Cc: gwl-devel@gnu.org
Subject: Re: support for containers
Date: Wed, 30 Jan 2019 13:46:49 +0100	[thread overview]
Message-ID: <87muninwhy.fsf@elephly.net> (raw)
In-Reply-To: <CAJ3okZ0QGwo6dzdYwEpMBauYN2vYgFMuYMR2jAuk=nTGoLhkZg@mail.gmail.com>


Hi Simon,

> On Wed, 30 Jan 2019 at 00:16, Ricardo Wurmus <rekado@elephly.net> wrote:
>
>> Since we don’t hash the data (because it’s expensive) the scripts are
>> “proxies” for the data files.  We compute the hashes over the dependent
>> scripts and assume that this is enough to decide whether to recompute
>> data files or to serve them from the cache/store.
>
> Just to be sure to well understand your point, let pick the simple
> example from genomics pipeline:
>  FASTQ -align-> BAM -variant-> VCF
> So, you intend to hash:
>  - the data FASTQ
>  - the scripts align and variant
> Or only the scripts containing reference to inputs (here FASTQ), where
> the reference is a location fixed by the user.

Currently, there is no good way for a user to pass inputs to a workflow,
so I haven’t yet thought about how to handle the user’s input files.
This still needs to be done.  Currently, the only way a user can provide
files as inputs is by writing a process that “generates” the file (even
if it does so by merely accessing the impure file system).  That’s
rather inconvenient and it wouldn’t work in a container where only
declared files are available.

Users should be able to map files to any process input from the command
line (or through a configuration file).  For a provided input we should
take into account the hash of some file property: the timestamp and the
name (cheap), or the contents (expensive).

As regards hashing the scripts here’s what I have so far:

--8<---------------cut here---------------start------------->8---
(define (workflow->data-hashes workflow engine)
  "Return an alist associating each of the WORKFLOW's processes with
the hash of all the process scripts used to generate their outputs."
  (define make-script (process->script engine))
  (define graph (workflow-restrictions workflow))

  ;; Compute hashes for chains of scripts.
  (define (kons process acc)
    (let* ((script (make-script process #:workflow workflow))
           (hash   (bytevector->u8-list
                    (sha256 (call-with-input-file script get-bytevector-all)))))
      (cons
       (cons process
             (append hash
                     ;; Hashes of processes this one depends on.
                     (append-map (cut assoc-ref acc <>)
                                 (or (assoc-ref graph process) '()))))
       acc)))
  (map (match-lambda
         ((process . hashes)
          (cons process
                (bytevector->base32-string
                 (sha256
                  (u8-list->bytevector hashes))))))
       (fold kons '()
             (workflow-run-order workflow #:parallel? #f))))
--8<---------------cut here---------------end--------------->8---

I.e. for any process we want the hash over the script used for the
current process and for all processes that lead up to the current one.

This gives us a hash string for every process.  We can then look up
“${GWL_STORE}/${hash}/output-file-name” — if it exists we use it.  The
workflow runner would now also need to ensure that process outputs are
linked to the appropriate GWL_STORE location upon successful execution.

--
Ricardo

  reply	other threads:[~2019-01-30 13:33 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-28 23:03 support for containers Ricardo Wurmus
2019-01-29  9:38 ` Ricardo Wurmus
2019-01-29 10:39   ` zimoun
2019-01-29 11:46     ` Ricardo Wurmus
2019-01-29 14:29       ` zimoun
2019-01-29 17:19         ` Ricardo Wurmus
2019-01-29 21:52           ` zimoun
2019-01-29 23:16             ` Ricardo Wurmus
2019-01-30 10:17               ` zimoun
2019-01-30 12:46                 ` Ricardo Wurmus [this message]
2019-01-29 10:22 ` zimoun
2019-01-29 11:44   ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.guixwl.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87muninwhy.fsf@elephly.net \
    --to=rekado@elephly.net \
    --cc=gwl-devel@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).