From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from list by lists.gnu.org with archive (Exim 4.71) id 1gopzD-0001E5-2C for mharc-gwl-devel@gnu.org; Wed, 30 Jan 2019 08:33:03 -0500 Received: from eggs.gnu.org ([209.51.188.92]:40216) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gopzA-0001C0-R0 for gwl-devel@gnu.org; Wed, 30 Jan 2019 08:33:01 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gopz9-0004Um-S3 for gwl-devel@gnu.org; Wed, 30 Jan 2019 08:33:00 -0500 Received: from sender-of-o51.zoho.com ([135.84.80.216]:21121) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gopz9-0004T3-J1 for gwl-devel@gnu.org; Wed, 30 Jan 2019 08:32:59 -0500 References: <87bm40qta0.fsf@elephly.net> <875zu7refm.fsf@elephly.net> <87womnptym.fsf@elephly.net> <874l9rpeiq.fsf@elephly.net> <87womnnjg0.fsf@elephly.net> From: Ricardo Wurmus In-reply-to: Date: Wed, 30 Jan 2019 13:46:49 +0100 Message-ID: <87muninwhy.fsf@elephly.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: support for containers List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: zimoun Cc: gwl-devel@gnu.org Hi Simon, > On Wed, 30 Jan 2019 at 00:16, Ricardo Wurmus wrote: > >> Since we don=E2=80=99t hash the data (because it=E2=80=99s expensive) th= e scripts are >> =E2=80=9Cproxies=E2=80=9D for the data files. We compute the hashes ove= r the dependent >> scripts and assume that this is enough to decide whether to recompute >> data files or to serve them from the cache/store. > > Just to be sure to well understand your point, let pick the simple > example from genomics pipeline: > FASTQ -align-> BAM -variant-> VCF > So, you intend to hash: > - the data FASTQ > - the scripts align and variant > Or only the scripts containing reference to inputs (here FASTQ), where > the reference is a location fixed by the user. Currently, there is no good way for a user to pass inputs to a workflow, so I haven=E2=80=99t yet thought about how to handle the user=E2=80=99s inp= ut files. This still needs to be done. Currently, the only way a user can provide files as inputs is by writing a process that =E2=80=9Cgenerates=E2=80=9D th= e file (even if it does so by merely accessing the impure file system). That=E2=80=99s rather inconvenient and it wouldn=E2=80=99t work in a container where only declared files are available. Users should be able to map files to any process input from the command line (or through a configuration file). For a provided input we should take into account the hash of some file property: the timestamp and the name (cheap), or the contents (expensive). As regards hashing the scripts here=E2=80=99s what I have so far: --8<---------------cut here---------------start------------->8--- (define (workflow->data-hashes workflow engine) "Return an alist associating each of the WORKFLOW's processes with the hash of all the process scripts used to generate their outputs." (define make-script (process->script engine)) (define graph (workflow-restrictions workflow)) ;; Compute hashes for chains of scripts. (define (kons process acc) (let* ((script (make-script process #:workflow workflow)) (hash (bytevector->u8-list (sha256 (call-with-input-file script get-bytevector-all= ))))) (cons (cons process (append hash ;; Hashes of processes this one depends on. (append-map (cut assoc-ref acc <>) (or (assoc-ref graph process) '())))) acc))) (map (match-lambda ((process . hashes) (cons process (bytevector->base32-string (sha256 (u8-list->bytevector hashes)))))) (fold kons '() (workflow-run-order workflow #:parallel? #f)))) --8<---------------cut here---------------end--------------->8--- I.e. for any process we want the hash over the script used for the current process and for all processes that lead up to the current one. This gives us a hash string for every process. We can then look up =E2=80=9C${GWL_STORE}/${hash}/output-file-name=E2=80=9D =E2=80=94 if it exi= sts we use it. The workflow runner would now also need to ensure that process outputs are linked to the appropriate GWL_STORE location upon successful execution. -- Ricardo