From mboxrd@z Thu Jan 1 00:00:00 1970 From: Amirouche Boubekki Subject: Use guix to distribute data & reproducible (data) science Date: Fri, 09 Feb 2018 17:32:32 +0100 Message-ID: <365e13248634ac1e26cf6678611d550d@hypermove.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:34534) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ekBbK-0008Jv-AN for guix-devel@gnu.org; Fri, 09 Feb 2018 11:32:39 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ekBbG-0001Hp-HL for guix-devel@gnu.org; Fri, 09 Feb 2018 11:32:38 -0500 Received: from relay2-d.mail.gandi.net ([217.70.183.194]:43623) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ekBbG-0001H5-Ab for guix-devel@gnu.org; Fri, 09 Feb 2018 11:32:34 -0500 Received: from webmail.gandi.net (webmail2-d.mgt.gandi.net [10.58.1.142]) (Authenticated sender: amirouche@hypermove.net) by relay2-d.mail.gandi.net (Postfix) with ESMTPA id D2D2CC5A81 for ; Fri, 9 Feb 2018 17:32:32 +0100 (CET) List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Guix Devel Héllo all, tl;dr: Distribution of data and software seems similar. Data is more and more important in software and reproducible science. Data science ecosystem lakes resources sharing. I think guix can help. Recently I stumbled upon open data movement and its links with data science. To give a high level overview, there is several (web) platforms that allows administrations and companies to publish data and _distribute_ it. Example of such platforms are data.gouv.fr [1] and various other platforms based on CKAN [2]. [1] https://www.data.gouv.fr/ [2] https://okfn.org/projects/ I have worked with data.gouv.fr in particular. And the repository is rather poor in terms of quality. Making very difficult to use. The other side of this open data and data based software is the fact that some software provide their own mechanism to _distribute_ data or binary blobs called 'models' that are sometime based on libre data. Example of such softwares are spacy [2], gensim [3], nltk [4] and word2vec. [2] https://spacy.io/ [3] https://en.wikipedia.org/wiki/Gensim [4] http://www.nltk.org/ My last point is that it's common knowledge that data wrangling aka. cleaning and preparing data is 80% of data scientist job. It's required because data distributors don't do it right, because they don't have the man power and the knowledge to do it right. To summarize: 1) Some software and platforms distribute _data_ themselves in some "closed garden" way. It's not the role of software to distribute data especially when that data can be reused in other contexts. 2) models are binary blobs that you use in the hope they do what they are supposed to do. How do you build the model? Is the model reproducible? 3) Preparing data must be re-done all the time, let's share resource and do it once. It seems to me that guix has all the required feature to handle data and models distribution. What do people think? Do we already use guix to distribute data and models. Also, it seems good to surf on AI frenzy ;)