From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ricardo Wurmus Subject: Re: Use guix to distribute data & reproducible (data) science Date: Mon, 19 Feb 2018 08:57:56 +0100 Message-ID: <87lgfpjtrv.fsf@elephly.net> References: <24274adb01ba9c928a4701054b686a4a@hypermove.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:52801) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1engL4-0008Fl-9o for guix-devel@gnu.org; Mon, 19 Feb 2018 02:58:19 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1engL0-0003jS-9I for guix-devel@gnu.org; Mon, 19 Feb 2018 02:58:18 -0500 Received: from sender-of-o51.zoho.com ([135.84.80.216]:21128) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1engKz-0003jE-VY for guix-devel@gnu.org; Mon, 19 Feb 2018 02:58:14 -0500 In-reply-to: <24274adb01ba9c928a4701054b686a4a@hypermove.net> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Amirouche Boubekki Cc: Guix Devel , ludovic.courtes@inria.fr Amirouche Boubekki writes: > Then, in a follow up mail, you reply to Konrad: > >>> Konrad Hinsen skribis: >> >> [...] >> >>> It would be nice if big datasets could conceptually be handled in the >>> same way while being stored elsewhere - a bit like git-annex does for >>> git. And for parallel computing, we could have special build daemons. >> >> Exactly. I think we need a git-annex/git-lfs-like tool for the store. >> (It could also be useful for things like secrets, which we don=E2=80=99t= want >> to have in the store.) In addition to the answers by Ludo and Roel, I=E2=80=99d like to add that f= or data we have more things that we=E2=80=99d like to know about. For any giv= en dataset on storage I=E2=80=99d like to know how it relates to previous vers= ions of the same dataset. The hash alone would not be sufficient. I=E2=80=99d actually need to know which dataset is the parent and which is a child. The store does not give me relations like that when given two or more items. The store retains information about links between items in one generation (if they embed such references), but not across generations. I think the requirements for the storage and retrieval of (big) datasets are very different to those of software packages. There are projects dedicated to dataset storage, such as Pachyderm.io. Since data storage is just a stepping stone to better workflows, Pachyderm also includes support for application bundles, but it may be better to let a dedicated workflow language take care of the application side. Maybe the GWL can be integrated with dedicated data storage solutions like Pachyderm. -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net