From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Hinsen <konrad.hinsen@fastmail.net>
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Fri, 9 Feb 2018 20:15:28 +0100
Message-ID: <1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net>
References: <365e13248634ac1e26cf6678611d550d@hypermove.net>
	<87mv0ixf07.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49346)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <konrad.hinsen@fastmail.net>) id 1ekE93-0004VI-Tb
	for guix-devel@gnu.org; Fri, 09 Feb 2018 14:15:38 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <konrad.hinsen@fastmail.net>) id 1ekE8y-0005H6-VA
	for guix-devel@gnu.org; Fri, 09 Feb 2018 14:15:37 -0500
Received: from out2-smtp.messagingengine.com ([66.111.4.26]:45847)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <konrad.hinsen@fastmail.net>)
	id 1ekE8y-0005DQ-Me
	for guix-devel@gnu.org; Fri, 09 Feb 2018 14:15:32 -0500
In-Reply-To: <87mv0ixf07.fsf@gnu.org>
Content-Language: en-US
List-Id: "Development of GNU Guix and the GNU System distribution."
	<guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guix-devel/>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: guix-devel@gnu.org

Hi,

On 09/02/2018 18:13, Ludovic Courtès wrote:

> Amirouche Boubekki <amirouche@hypermove.net> skribis:
> 
>> tl;dr: Distribution of data and software seems similar.
>>         Data is more and more important in software and reproducible
>>         science. Data science ecosystem lakes resources sharing.
>>         I think guix can help.
> 
> Now, whether Guix is the right tool to distribute data, I don’t know.
> Distributing large amounts of data is a job in itself, and the store
> isn’t designed for that.  It could quickly become a bottleneck.  That’s
> one of the reasons why the Guix Workflow Language (GWL) does not store
> scientific data in the store itself.

I'd say it depends on the data and how it is used inside and outside of 
a workflow. Some data could very well stored in the store, and then 
distributed via standard channels (Zenodo, ...) after export by "guix 
pack". For big datasets, some other mechanism is required.

I think it's worth thinking carefully about how to exploit guix for 
reproducible computations. As Lispers know very well, code is data and 
data is code. Building a package is a computation like any other. 
Scientific workflows could be handled by a specific build system. In 
fact, as long as no big datasets or multiple processors are involved, we 
can do this right now, using standard package declarations.

It would be nice if big datasets could conceptually be handled in the 
same way while being stored elsewhere - a bit like git-annex does for 
git. And for parallel computing, we could have special build daemons.

Konrad.