From mboxrd@z Thu Jan  1 00:00:00 1970
From: zimoun <zimon.toutoune@gmail.com>
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Thu, 15 Feb 2018 18:10:33 +0100
Message-ID: <CAJ3okZ14RAUU4FrBMEOpOLU9aV21BJbY1G5-WeZq3Q-TEmk1Hg@mail.gmail.com>
References: <365e13248634ac1e26cf6678611d550d@hypermove.net>
	<87mv0ixf07.fsf@gnu.org>
	<1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net>
	<87lgfvu3dg.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:44312)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zimon.toutoune@gmail.com>) id 1emN3O-0008GR-7d
	for guix-devel@gnu.org; Thu, 15 Feb 2018 12:10:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <zimon.toutoune@gmail.com>) id 1emN3N-0004NY-2O
	for guix-devel@gnu.org; Thu, 15 Feb 2018 12:10:38 -0500
In-Reply-To: <87lgfvu3dg.fsf@gnu.org>
List-Id: "Development of GNU Guix and the GNU System distribution."
	<guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guix-devel/>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: =?UTF-8?Q?Ludovic_Court=C3=A8s?= <ludo@gnu.org>
Cc: Guix Devel <guix-devel@gnu.org>

Hi,

Thank you for this food for thought.


I agree that the frontier between code and data is arbitary.

However, I am not sure to get the picture about the data management in
the context of Reproducible Science. What is the issue ?

So, I catch your invitation to explore your idea. :-)


Let think about the old lab experiment. On one hand, you have your
protocol and the description of all the steps. On the other hand, you
have measurements and results. Then, I am able to imagine a sense of
some bit-to-bit mechanism for the protocol part. I am not sure about the
measurements part.

Well, protocol is code or workflow; measurements are data.
And I agree that e.g., information of electronic orbits or weights of a
trained neural network is sometimes part of the protocol. :-)

For me, just talking about code, it is not a straightforward task to
define what are the properties for a reproducible and fully controled
computational environment. It is --I guess-- what Guix is defining
(transactional, user-profile, hackable, etc.). Then, it appears to me
even more difficult about data.

What are such properties for data management ?

In other words, on the paper, what are the benefits of a management of
some piece of data in the store ? For example for the applications of
weights of a trained neural network; or of the positions of the atoms in
protein structure.


For me --maybe I have wrong-- the way is to define a package (or
workflow) that fetches the data from some external source, cleans if
needed, does some checks, and then puts it to /path/to/somewhere/
outside the store. In parallel computing, this /path/to/somewhere/ is
accessible by all the nodes. Moreover, this /path/to/somewhere/ contains
something hash-based in the folder name.

Is it not enough ?

Why do you need the history of changes ? as git provide ?


Secrets is another story than reproducible science toolchain, I guess.


Thank you again.

All the best,
simon