Re: Use guix to distribute data & reproducible (data) science

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: zimoun <zimon.toutoune@gmail.com>
To: Konrad Hinsen <konrad.hinsen@fastmail.net>
Cc: Guix Devel <guix-devel@gnu.org>
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Sat, 10 Feb 2018 00:01:56 +0100	[thread overview]
Message-ID: <CAJ3okZ2rsr=NYa7U_Af33puNaMsXuCL2jSjh5UMnFUubGqEnaw@mail.gmail.com> (raw)
In-Reply-To: <1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net>

Hi,

> I'd say it depends on the data and how it is used inside and outside of a
> workflow. Some data could very well stored in the store, and then
> distributed via standard channels (Zenodo, ...) after export by "guix pack".
> For big datasets, some other mechanism is required.

I am not sure to understand the point.
From my point of view, there is 2 kind of datasets:
 a- the ones which are part of the software, e.g., used to pass the
tests. Therefore, they are usually small, not always;
 b- the ones which are applied to the software and somehow they are
not in the source repository. They are big or not.

I do not know if some policy is established in guix about the case a-,
not sure that it is possible in fact (e.g., include Whole Genome fasta
to test alignment tools ? etc.).

It does not appear to me a good idea to try to include in the store
datasets of case b-.
Is it not the job of data management tools ? e.g., database etc.

I do not know so much, but a idea should to write a workflow: you
fetch the data, you clean them and you check by hashing that the
result is the expected one. Only the softwares used to do that are in
the store. The input and output data are not, but your workflow check
that they are the expected ones.
However, it depends on what we are calling 'cleaning' because some
algorithms are not deterministic.

Hum? I do not know if there is some mechanism in GWL to check the hash
of the `data-inputs' field.

> I think it's worth thinking carefully about how to exploit guix for
> reproducible computations. As Lispers know very well, code is data and data
> is code. Building a package is a computation like any other. Scientific
> workflows could be handled by a specific build system. In fact, as long as
> no big datasets or multiple processors are involved, we can do this right
> now, using standard package declarations.

It appear to me as a complement of these points ---and personnally, I
learn some points about the design of GWL--- with this thread:
https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html

> It would be nice if big datasets could conceptually be handled in the same
> way while being stored elsewhere - a bit like git-annex does for git. And
> for parallel computing, we could have special build daemons.

Hum? the point is to add data management a la git-annex to GWL ? Is it ?

Have a nice week-end !
simon

next prev parent reply	other threads:[~2018-02-09 23:02 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-09 17:13 ` Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun [this message]
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen
2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
2018-02-14  6:07           ` Pjotr Prins
2018-02-14  7:27             ` Fis Trivial
2018-02-14  8:04           ` Konrad Hinsen
2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki
  -- strict thread matches above, loose matches on Subject: below --
2018-02-16 16:43 Amirouche Boubekki
2018-02-17 22:21 ` Roel Janssen
2018-02-18 23:42 ` Ludovic Courtès
2018-02-19  7:57 ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ3okZ2rsr=NYa7U_Af33puNaMsXuCL2jSjh5UMnFUubGqEnaw@mail.gmail.com' \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=konrad.hinsen@fastmail.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.