From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:8:6d80::]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id 6D+cDuSHXWBoOAAAgWs5BA (envelope-from ) for ; Fri, 26 Mar 2021 08:06:12 +0100 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id mAKACOSHXWCtRwAAB5/wlQ (envelope-from ) for ; Fri, 26 Mar 2021 07:06:12 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 8D9AC31EF5 for ; Fri, 26 Mar 2021 08:06:11 +0100 (CET) Received: from localhost ([::1]:49394 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lPgXq-0000HQ-75 for larch@yhetil.org; Fri, 26 Mar 2021 03:06:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:46390) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lPgXl-0000H7-8z for gwl-devel@gnu.org; Fri, 26 Mar 2021 03:06:05 -0400 Received: from mail-wm1-x32e.google.com ([2a00:1450:4864:20::32e]:55965) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lPgXd-0001Tq-O7 for gwl-devel@gnu.org; Fri, 26 Mar 2021 03:06:00 -0400 Received: by mail-wm1-x32e.google.com with SMTP id 12so2404482wmf.5 for ; Fri, 26 Mar 2021 00:05:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:in-reply-to:references:date:message-id:mime-version :content-transfer-encoding; bh=LoB3ImgDgyRB35G15g0fdN/CBjGl7enqAcrrl0Dh2H0=; b=H/4baVnBIykkgfxle+YhKBppmvvMKCfLXnoSXWJm55MkVz82FA9YFuUtUU/UbR83/k ypvZHbS0ISjLANilMnabY45gVoUwSdCRSRsu6olKlTHO7U7gK6cRIv4Wm1aF08LgKpM9 i2xxadTZ1Pw//VZdnyt6fa8hoOZhGgPXkzKVBmhxFcJcF9nXlwhMfFggIwOFt5VFcaBS aKhV90dxnSkcMh9K2lgVKNGaC7CmDxfzTfSLWJo8tWjZK6qyOtOWrIXkAOywKtIepZJ+ BT8YYpIlrw6LZs4v4mujK/Rvoq/gjfpMbrqQoN2+iSwEDGoBeSeysQwxyl4j8srJSsSm /6YA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=LoB3ImgDgyRB35G15g0fdN/CBjGl7enqAcrrl0Dh2H0=; b=eC7nM1C1YBo/Un7FkFPD7VghfafHSsF6VvZs0XE2W+g/yxdqnn0LrxWvug/aI7LDpY JxE9RLhz9J0eHw1VsWrl5E9HtpQPMb/H3Ns8gNvq66N0C3WonuR3jEBvsS1i5B9A96AE 9NTfyI2zIQR+fGqv+17Oa5/A9Xwx+fbIBsvu45Rp5pM/Tk2nNeN75hD9pAtncNsIU06v JUylVfnhefaRIe6C0MOGSWN1ww0ZPVCsLCH/MLfn9pIlVN+gg09v6QUXRY3ysJOiiXEK BiKLNz2Q6SRW/wnz7NI9E+X+e6bwSRK6zkrcfwy+4UGqln4DVPXfiMr//zdfKa6KZqLZ P37Q== X-Gm-Message-State: AOAM533JkJlqZvwmXgTgVWf33jKcHyYLVcLDqfwFFBd8G6zl0wMzc+ZO oU2vM4nJ7Adml7rs/ttRE6GpG4DPdjw= X-Google-Smtp-Source: ABdhPJzcR8Bp4QaBb16Bqx6j6aeZceCGsGnyo3QbPhO9B4lM3ZK1ooPvdQ4BVJfFgr6XX13owmU43g== X-Received: by 2002:a7b:c242:: with SMTP id b2mr11912954wmj.119.1616742355920; Fri, 26 Mar 2021 00:05:55 -0700 (PDT) Received: from lili ([2a01:e0a:59b:9120:65d2:2476:f637:db1e]) by smtp.gmail.com with ESMTPSA id o5sm4191730wrx.60.2021.03.26.00.05.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Mar 2021 00:05:55 -0700 (PDT) From: zimoun To: Konrad Hinsen , gwl-devel@gnu.org Subject: Re: Managing data files in workflows In-Reply-To: References: Date: Fri, 26 Mar 2021 08:02:40 +0100 Message-ID: <86v99ebdnz.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::32e; envelope-from=zimon.toutoune@gmail.com; helo=mail-wm1-x32e.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: gwl-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gwl-devel-bounces+larch=yhetil.org@gnu.org Sender: "gwl-devel" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1616742371; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=LoB3ImgDgyRB35G15g0fdN/CBjGl7enqAcrrl0Dh2H0=; b=Yfswkf01IiKZWv9Ry/4UsiIrWflJoZv9Lc8BAbpTlqq4tX4uVhbltckqKkfH4hyK/HKdeV bl9FvGwVObnmi3ydh6a7JVcyQUsKm7deUOFlvl0hmkaRsn+nQTklNhmR3QNrlcpk5GCrxI V6RWJnNvB4xaagTjZZMzYyYPco/z1ZTaKq6Uxgf0SgOm7OlyLh4TmLrd5enGXALli3fBhx 5ytNs7TdBojcvKeGon9ZfEWhz59pq2uC6dTfQTpOu/jL75LlQE+VaIEVTnVPiAWa7naZpB SUvJRja/EXWSyjqTF5Lj5q/I1ys5k8bpxr3VAjOXz7BNb8a7dOLaG9AWZFbm7w== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1616742371; a=rsa-sha256; cv=none; b=ReZuJhJLfTx+8iz2jULVUC/PnlPvbgrWhxvYcMnUzvUFb16OzM89SR7IxGgavRXL6qXjum JNihJnLt3qIjQmDwi9pB0SmsPd5Fifpwmv/U5cfv7bohYajT95G2TsnJ72I+vnZo+H+h7X A5mbB+lIbLohunXmjI9PzTymf4GgzJ/shYw3MDTZYYX1m9IWQNW5ela6tPi2FPghgn+FHp dFbxU+r4vl2p4qoRLdFMy4hLQ/X4D/wHrbchhe3IMdTUxkg3yfdDpWuIPjuXUAFLMhcFBV n2ifzvv0avsQnrIKhtNxkY4flkwyBx9TAen4r4SBKu/wIJEbNPONxB1MHq6bvg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20161025 header.b="H/4baVnB"; spf=pass (aspmx1.migadu.com: domain of gwl-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=gwl-devel-bounces@gnu.org X-Migadu-Spam-Score: -3.12 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20161025 header.b="H/4baVnB"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of gwl-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=gwl-devel-bounces@gnu.org X-Migadu-Queue-Id: 8D9AC31EF5 X-Spam-Score: -3.12 X-Migadu-Scanner: scn0.migadu.com X-TUID: 0Z6qSeInWZq1 Hi Konrad, It does not answer your concrete question but instead open a new one. :-) Well, I never finished this drafts, maybe it can be worth to discuss 1. how to deal with data? 2. on which does the workflow trigger a recomputation? Cheers, simon -------------------- Start of forwarded message -------------------- Hi, The recent features of the Guix Workflow Language [1] are really neat! The end-to-end paper by Ludo [2] is also really cool! For the online Guix Day back on December, it would have been cool to be able to distribute the videos via a channel. Or it could be cool to have all the material talks [3] in a channel. But a package is not the right abstraction here. First because a =E2=80=9C= data=E2=80=9D can have multiple sources, second data can be really large and third data are not always readable as source and do not have an output; data are kind of fixed output. (Code is data but data is not code. :-)) Note that data is already fetched via packages, see =E2=80=99r-bsgenome-hsapiens-ucsc-hg19=E2=80=99 or =E2=80=99r-bsgenome-hsap= iens-ucsc-hg38=E2=80=99 (=E2=80=99guix import=E2=80=99 reports ~677.3MiB and =E2=80=99guix size=E2= =80=99 reports ~748.0 MiB). I am not speaking about these. If I might, let take the example of Lars=E2=80=99s talk from Guix Day: There is 2 parts: the video itself and the slides. Both are part of the same. Another example is Konrad=E2=80=99s paper: with the paper and the supplementary (code+data). With these 2 examples, =E2=80=99package=E2=80=99 with some tweaks could be = used. But for the data I deal at work, the /gnu/store is not designed for that. To fix the idea, about (large) genomics study, let say 100 patients and 0.5-10GB data for each. In addition to genomics reference which means a couple of GB. At work, these days we do not have too much new genomic projects; let say there 3 projects in parallel. I let you complete the calculus. ;-) There is 3 levels: 1- the methods for fetching: URL (http or ftp), Git, IPFS, Dat, etc. 2- the record representing a =E2=80=9Cdata=E2=80=9D 3- how to effectively locally store and deal with it And if it makes sense that a =E2=80=99data=E2=80=99 is an input of a =E2=80=99package=E2=80=99, and conversely, is a question. Long time ago, with GWL folks we discussed =E2=80=9Cbackend=E2=80=9D, as gi= t-annex or something else, but from my understanding, it would answer about #3 and what git-annex accepts as protocol would answer to #1. Remaining #2. In my project, I would like to have 3 files: manifest describing which tools, channels describing at which version, and data describing how to fetch the data. Then, I have the tool to work reproducibly: I can apply a workflow (GWL, my custom Python script, etc.). 1: 2: 3: Cheers, simon -------------------- End of forwarded message --------------------