From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Hinsen Subject: Re: Use guix to distribute data & reproducible (data) science Date: Fri, 9 Feb 2018 20:15:28 +0100 Message-ID: <1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net> References: <365e13248634ac1e26cf6678611d550d@hypermove.net> <87mv0ixf07.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:49346) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ekE93-0004VI-Tb for guix-devel@gnu.org; Fri, 09 Feb 2018 14:15:38 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ekE8y-0005H6-VA for guix-devel@gnu.org; Fri, 09 Feb 2018 14:15:37 -0500 Received: from out2-smtp.messagingengine.com ([66.111.4.26]:45847) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ekE8y-0005DQ-Me for guix-devel@gnu.org; Fri, 09 Feb 2018 14:15:32 -0500 In-Reply-To: <87mv0ixf07.fsf@gnu.org> Content-Language: en-US List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: guix-devel@gnu.org Hi, On 09/02/2018 18:13, Ludovic Courtès wrote: > Amirouche Boubekki skribis: > >> tl;dr: Distribution of data and software seems similar. >> Data is more and more important in software and reproducible >> science. Data science ecosystem lakes resources sharing. >> I think guix can help. > > Now, whether Guix is the right tool to distribute data, I don’t know. > Distributing large amounts of data is a job in itself, and the store > isn’t designed for that. It could quickly become a bottleneck. That’s > one of the reasons why the Guix Workflow Language (GWL) does not store > scientific data in the store itself. I'd say it depends on the data and how it is used inside and outside of a workflow. Some data could very well stored in the store, and then distributed via standard channels (Zenodo, ...) after export by "guix pack". For big datasets, some other mechanism is required. I think it's worth thinking carefully about how to exploit guix for reproducible computations. As Lispers know very well, code is data and data is code. Building a package is a computation like any other. Scientific workflows could be handled by a specific build system. In fact, as long as no big datasets or multiple processors are involved, we can do this right now, using standard package declarations. It would be nice if big datasets could conceptually be handled in the same way while being stored elsewhere - a bit like git-annex does for git. And for parallel computing, we could have special build daemons. Konrad.