From mboxrd@z Thu Jan 1 00:00:00 1970 From: Amirouche Boubekki Subject: Re: Use guix to distribute data & reproducible (data) science Date: Fri, 16 Feb 2018 12:41:18 +0000 Message-ID: References: <365e13248634ac1e26cf6678611d550d@hypermove.net> <87mv0ixf07.fsf@gnu.org> <1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net> <87lgfvu3dg.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a113cd5e6d420c3056553a91f" Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:44859) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1emfKY-0006nq-Ph for guix-devel@gnu.org; Fri, 16 Feb 2018 07:41:36 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1emfKX-0004Nj-2r for guix-devel@gnu.org; Fri, 16 Feb 2018 07:41:34 -0500 In-Reply-To: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: zimoun Cc: Guix Devel --001a113cd5e6d420c3056553a91f Content-Type: text/plain; charset="UTF-8" On Thu, Feb 15, 2018 at 6:11 PM zimoun wrote: > Hi, > > Thank you for this food for thought. > > > I agree that the frontier between code and data is arbitary. > > However, I am not sure to get the picture about the data management in > the context of Reproducible Science. What is the issue ? > > So, I catch your invitation to explore your idea. :-) > [...] > For me, just talking about code, it is not a straightforward task to > define what are the properties for a reproducible and fully controled > computational environment. It is --I guess-- what Guix is defining > (transactional, user-profile, hackable, etc.). Then, it appears to me > even more difficult about data. > > What are such properties for data management ? > In other words, on the paper, what are the benefits of a management of > some piece of data in the store ? For example for the applications of > weights of a trained neural network; or of the positions of the atoms in > protein structure. > Given version-ed datasets you could want to switch the input dataset of a given "pipeline" to see how different data produce different results. Also, it is desirable to be able to re-start a "pipeline" when a datasets is updated. For me --maybe I have wrong-- the way is to define a package (or > workflow) that fetches the data from some external source, cleans if > needed, does some checks, and then puts it to /path/to/somewhere/ > outside the store. In parallel computing, this /path/to/somewhere/ is > accessible by all the nodes. Moreover, this /path/to/somewhere/ contains > something hash-based in the folder name. > > Is it not enough ? > It is not enough, you could need to make a diff between two datasets which is not easily done if the data is stored in tarballs. But that is not a case that can be handled by guix. Why do you need the history of changes ? as git provide ? > Because, if the dataset introduce a change that is not handled by the rest of the code you can get to know it by looking up the diff. For instance, a column that is an enumeration of three values that has now a fourth. But again, it's not a case that's meant to be handled by guix. Like others have said, there is different kind of data and - even if it was possible to handle large datasets in guix store - it would require also a lot of space and computation power. ConceptNet 5.5.5 is 10G which takes more than a dozen of hours to build and AFAIK is not reproducible since it takes its input directly on live instance of the wiktionary. WikiData is 100G, but requires no processing power. Those are structured data that you could want to version in something like git. But, things like spacy models that are around 1G take I guess around a few hours to build, are not structured. Those are the data that I know about and there very few of them compared to small datasets (see data.gouv.fr) I think they are various opportunities around reproducible data science. In particular, I see two main opportunities : a) Packaging data and work with upstream to keep the data clean. A work that is already handled by privaters and in initiatives like http://datahub.io/ b) Cooperate around the making of datasets, some kind of *git for data* for which there is few or no initiatives. FWIW, I started such a project I call neon . Thanks for the feedback. --001a113cd5e6d420c3056553a91f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Thu= , Feb 15, 2018 at 6:11 PM zimoun <zimon.toutoune@gmail.com> wrote:
Hi,

Thank you for this food for thought.


I agree that the frontier between code and data is arbitary.

However, I am not sure to get the picture about the data management in
the context of Reproducible Science. What is the issue ?

So, I catch your invitation to explore your idea. :-)
=
[...]
=C2=A0
For me, just talking about code, it is not a straightforward task to
define what are the properties for a reproducible and fully controled
computational environment. It is --I guess-- what Guix is defining
(transactional, user-profile, hackable, etc.). Then, it appears to me
even more difficult about data.
=C2=A0
What are such properties for data management ?

In other words, on the paper, what are the benefits of a management of
some piece of data in the store ? For example for the applications of
weights of a trained neural network; or of the positions of the atoms in protein structure.

Given version-ed dat= asets you could want to switch
the input dataset of a given "= ;pipeline" to see how different data
produce different resul= ts.

Also, it is desirable to be able to re-sta= rt a "pipeline" when a
datasets is updated.
<= div>
For me --maybe I have wrong-- the way is to define a package (or
workflow) that fetches the data from some external source, cleans if
needed, does some checks, and then puts it to /path/to/somewhere/
outside the store. In parallel computing, this /path/to/somewhere/ is
accessible by all the nodes. Moreover, this /path/to/somewhere/ contains something hash-based in the folder name.

Is it not enough ?

It is not enough, yo= u could need to make a diff between two
datasets which is no= t easily done if the data is stored in tarballs.
But that is not = a case that can be handled by guix.

Why do you need the history of changes ? as git provide ?
<= div>
Because, if the dataset introduce a change that is not h= andled by
the rest of the code you can get to know it by looking = up the diff. For instance, a column that is an enumeration of three values = that has now a fourth. But again, it's not a case that's meant to b= e handled by guix.

Like others have said, there is= different kind of data and - even if it=20 was possible to handle large datasets in guix store - it would require=20 also a lot of space and computation power. ConceptNet 5.5.5 is 10G which takes more than a dozen of hours to build and AFAIK is not reproducible since it takes its input directly on live instance of the wiktionary.=20 WikiData is 100G, but requires no processing power. Those are structured data that you could want to version in something like git. But, things lik= e spacy models that are around 1= G take I guess around a few hours to build, are not structured. Those are t= he data that I know about and there very few of them compared to small data= sets (see data.gouv.fr)
<= br>
I think they are various opportunities around reproducible da= ta science. In particular, I see two main opportunities :

a) Packaging data and work with upstream to keep the data clean. A = work that is already handled by privaters and in initiatives like http://datahub.io/

b) Cooperate around the making of datasets, some kind of git for data for which there is few or no initiatives. FWIW, I started such a project= I call neon.

Thanks for the feedback.
--001a113cd5e6d420c3056553a91f--