From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Subject: Re: Use guix to distribute data & reproducible (data) science Date: Mon, 19 Feb 2018 00:42:01 +0100 Message-ID: <87r2phhnli.fsf@gnu.org> References: <24274adb01ba9c928a4701054b686a4a@hypermove.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:43470) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1enYas-0004cM-B6 for guix-devel@gnu.org; Sun, 18 Feb 2018 18:42:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1enYar-0002mo-CR for guix-devel@gnu.org; Sun, 18 Feb 2018 18:42:06 -0500 Received: from hera.aquilenet.fr ([2a0c:e300::1]:60134) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1enYar-0002mY-4w for guix-devel@gnu.org; Sun, 18 Feb 2018 18:42:05 -0500 In-Reply-To: <24274adb01ba9c928a4701054b686a4a@hypermove.net> (Amirouche Boubekki's message of "Fri, 16 Feb 2018 17:43:40 +0100") List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Amirouche Boubekki Cc: Guix Devel Hi Amirouche, Amirouche Boubekki skribis: > On 2018-02-09 18:13, ludovic.courtes@inria.fr wrote: >> Hi! >> >> Amirouche Boubekki skribis: >> >>> tl;dr: Distribution of data and software seems similar. >>> Data is more and more important in software and reproducible >>> science. Data science ecosystem lakes resources sharing. >>> I think guix can help. >> >> I think some of us especially Guix-HPC folks are convinced about the >> usefulness of Guix as one of the tools in the reproducible science >> toolchain (that was one of the themes of my FOSDEM talk). :-) >> >> Now, whether Guix is the right tool to distribute data, I don=E2=80=99t = know. >> Distributing large amounts of data is a job in itself, and the store >> isn=E2=80=99t designed for that. It could quickly become a bottleneck. > > What does it mean technically that the store =E2=80=9Cisn't designed for = that=E2=80=9D? There are several potential issues. One is GC: how convenient is it to have big datasets subject to GC? Another one is I/O bottleneck: when adding a file to the store, you currently do an =E2=80=98add-to-store=E2=80= =99 RPC to the daemon, pass it the file name, which the daemon then reads entirely to compute its content hash; could be an issue with big datasets. HTH, Ludo=E2=80=99.