From mboxrd@z Thu Jan  1 00:00:00 1970
From: Amirouche Boubekki <amirouche.boubekki@gmail.com>
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Fri, 16 Feb 2018 12:41:18 +0000
Message-ID: <CAL7_Mo-CrBFzDF49aexuindEKPfevrC2=rk-0oFX_U2qwt_Ycg@mail.gmail.com>
References: <365e13248634ac1e26cf6678611d550d@hypermove.net>
	<87mv0ixf07.fsf@gnu.org>
	<1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net>
	<87lgfvu3dg.fsf@gnu.org>
	<CAJ3okZ14RAUU4FrBMEOpOLU9aV21BJbY1G5-WeZq3Q-TEmk1Hg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="001a113cd5e6d420c3056553a91f"
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:44859)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <amirouche.boubekki@gmail.com>) id 1emfKY-0006nq-Ph
	for guix-devel@gnu.org; Fri, 16 Feb 2018 07:41:36 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <amirouche.boubekki@gmail.com>) id 1emfKX-0004Nj-2r
	for guix-devel@gnu.org; Fri, 16 Feb 2018 07:41:34 -0500
In-Reply-To: <CAJ3okZ14RAUU4FrBMEOpOLU9aV21BJbY1G5-WeZq3Q-TEmk1Hg@mail.gmail.com>
List-Id: "Development of GNU Guix and the GNU System distribution."
	<guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guix-devel/>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: zimoun <zimon.toutoune@gmail.com>
Cc: Guix Devel <guix-devel@gnu.org>

--001a113cd5e6d420c3056553a91f
Content-Type: text/plain; charset="UTF-8"

On Thu, Feb 15, 2018 at 6:11 PM zimoun <zimon.toutoune@gmail.com> wrote:

> Hi,
>
> Thank you for this food for thought.
>
>
> I agree that the frontier between code and data is arbitary.
>
> However, I am not sure to get the picture about the data management in
> the context of Reproducible Science. What is the issue ?
>
> So, I catch your invitation to explore your idea. :-)
>

[...]


> For me, just talking about code, it is not a straightforward task to
> define what are the properties for a reproducible and fully controled
> computational environment. It is --I guess-- what Guix is defining
> (transactional, user-profile, hackable, etc.). Then, it appears to me
> even more difficult about data.
>


> What are such properties for data management ?
>

In other words, on the paper, what are the benefits of a management of
> some piece of data in the store ? For example for the applications of
> weights of a trained neural network; or of the positions of the atoms in
> protein structure.
>

Given version-ed datasets you could want to switch
the input dataset of a given "pipeline" to see how different data
produce different results.

Also, it is desirable to be able to re-start a "pipeline" when a
datasets is updated.

For me --maybe I have wrong-- the way is to define a package (or
> workflow) that fetches the data from some external source, cleans if
> needed, does some checks, and then puts it to /path/to/somewhere/
> outside the store. In parallel computing, this /path/to/somewhere/ is
> accessible by all the nodes. Moreover, this /path/to/somewhere/ contains
> something hash-based in the folder name.
>
> Is it not enough ?
>

It is not enough, you could need to make a diff between two
datasets which is not easily done if the data is stored in tarballs.
But that is not a case that can be handled by guix.

Why do you need the history of changes ? as git provide ?
>

Because, if the dataset introduce a change that is not handled by
the rest of the code you can get to know it by looking up the diff. For
instance, a column that is an enumeration of three values that has now a
fourth. But again, it's not a case that's meant to be handled by guix.

Like others have said, there is different kind of data and - even if it was
possible to handle large datasets in guix store - it would require also a
lot of space and computation power. ConceptNet 5.5.5 is 10G which takes
more than a dozen of hours to build and AFAIK is not reproducible since it
takes its input directly on live instance of the wiktionary. WikiData is
100G, but requires no processing power. Those are structured data that you
could want to version in something like git. But, things like spacy models
<https://spacy.io/models/en> that are around 1G take I guess around a few
hours to build, are not structured. Those are the data that I know about
and there very few of them compared to small datasets (see data.gouv.fr)

I think they are various opportunities around reproducible data science. In
particular, I see two main opportunities :

a) Packaging data and work with upstream to keep the data clean. A work
that is already handled by privaters and in initiatives like
http://datahub.io/

b) Cooperate around the making of datasets, some kind of *git for data* for
which there is few or no initiatives. FWIW, I started such a project I call
neon <http://www.hyperdev.fr/projects/neon/>.

Thanks for the feedback.

--001a113cd5e6d420c3056553a91f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Thu=
, Feb 15, 2018 at 6:11 PM zimoun &lt;<a href=3D"mailto:zimon.toutoune@gmail=
.com">zimon.toutoune@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D=
"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding=
-left:1ex">Hi,<br>
<br>
Thank you for this food for thought.<br>
<br>
<br>
I agree that the frontier between code and data is arbitary.<br>
<br>
However, I am not sure to get the picture about the data management in<br>
the context of Reproducible Science. What is the issue ?<br>
<br>
So, I catch your invitation to explore your idea. :-)<br></blockquote><div>=
<br></div><div>[...]<br></div><div>=C2=A0<br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">
For me, just talking about code, it is not a straightforward task to<br>
define what are the properties for a reproducible and fully controled<br>
computational environment. It is --I guess-- what Guix is defining<br>
(transactional, user-profile, hackable, etc.). Then, it appears to me<br>
even more difficult about data.<br></blockquote><div>=C2=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">

What are such properties for data management ?<br></blockquote><div><br></d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex">
In other words, on the paper, what are the benefits of a management of<br>
some piece of data in the store ? For example for the applications of<br>
weights of a trained neural network; or of the positions of the atoms in<br=
>
protein structure.<br></blockquote><div><br></div><div>Given version-ed dat=
asets you could want to switch</div><div>the input dataset of a given &quot=
;pipeline&quot; to see how different data</div><div>produce different resul=
ts.<br></div><div><br></div><div>Also, it is desirable to be able to re-sta=
rt a &quot;pipeline&quot; when a <br></div><div>datasets is updated.</div><=
div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">

For me --maybe I have wrong-- the way is to define a package (or<br>
workflow) that fetches the data from some external source, cleans if<br>
needed, does some checks, and then puts it to /path/to/somewhere/<br>
outside the store. In parallel computing, this /path/to/somewhere/ is<br>
accessible by all the nodes. Moreover, this /path/to/somewhere/ contains<br=
>
something hash-based in the folder name.<br>
<br>
Is it not enough ?<br></blockquote><div><br></div><div>It is not enough, yo=
u could need to make a diff between two <br></div><div>datasets which is no=
t easily done if the data is stored in tarballs.</div><div>But that is not =
a case that can be handled by guix.<br></div><div><br></div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pa=
dding-left:1ex">
Why do you need the history of changes ? as git provide ?<br></blockquote><=
div><br></div><div>Because, if the dataset introduce a change that is not h=
andled by</div><div>the rest of the code you can get to know it by looking =
up the diff. For instance, a column that is an enumeration of three values =
that has now a fourth. But again, it&#39;s not a case that&#39;s meant to b=
e handled by guix.</div><div><br></div><div>Like others have said, there is=
 different kind of data and - even if it=20
was possible to handle large datasets in guix store - it would require=20
also a lot of space and computation power. ConceptNet 5.5.5 is 10G which
 takes more than a dozen of hours to build and AFAIK is not reproducible
 since it takes its input directly on live instance of the wiktionary.=20
WikiData is 100G, but requires no processing power. Those are structured
 data that you could want to version in something like git. But, things lik=
e <a href=3D"https://spacy.io/models/en">spacy models</a> that are around 1=
G take I guess around a few hours to build, are not structured. Those are t=
he data that I know about and there very few of them compared to small data=
sets (see <a href=3D"http://data.gouv.fr">data.gouv.fr</a>)<br></div><div><=
br></div><div>I think they are various opportunities around reproducible da=
ta science. In particular, I see two main opportunities :</div><div><br></d=
iv><div>a) Packaging data and work with upstream to keep the data clean. A =
work that is already handled by privaters and in initiatives like <a href=
=3D"http://datahub.io/">http://datahub.io/<br></a></div><div><br></div><div=
>b) Cooperate around the making of datasets, some kind of <i>git for data</=
i> for which there is few or no initiatives. FWIW, I started such a project=
 I call <a href=3D"http://www.hyperdev.fr/projects/neon/">neon</a>.</div><d=
iv><br></div><div>Thanks for the feedback.<br></div></div></div>

--001a113cd5e6d420c3056553a91f--