From mboxrd@z Thu Jan 1 00:00:00 1970 From: Amirouche Boubekki Subject: Re: Use guix to distribute data & reproducible (data) science Date: Sat, 10 Feb 2018 09:51:41 +0000 Message-ID: References: <365e13248634ac1e26cf6678611d550d@hypermove.net> <87mv0ixf07.fsf@gnu.org> <1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a113543163251710564d898f8" Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:33147) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ekRp5-0000td-Av for guix-devel@gnu.org; Sat, 10 Feb 2018 04:51:56 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ekRp4-0003Or-40 for guix-devel@gnu.org; Sat, 10 Feb 2018 04:51:55 -0500 Received: from mail-ot0-x22a.google.com ([2607:f8b0:4003:c0f::22a]:44796) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ekRp3-0003Np-SF for guix-devel@gnu.org; Sat, 10 Feb 2018 04:51:54 -0500 Received: by mail-ot0-x22a.google.com with SMTP id l5so10004003otj.11 for ; Sat, 10 Feb 2018 01:51:53 -0800 (PST) In-Reply-To: <1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Konrad Hinsen Cc: guix-devel@gnu.org --001a113543163251710564d898f8 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Feb 9, 2018 at 8:16 PM Konrad Hinsen wrote: > Hi, > > On 09/02/2018 18:13, Ludovic Court=C3=A8s wrote: > > > Amirouche Boubekki skribis: > > > >> tl;dr: Distribution of data and software seems similar. > >> Data is more and more important in software and reproducible > >> science. Data science ecosystem lakes resources sharing. > >> I think guix can help. > > > > Now, whether Guix is the right tool to distribute data, I don=E2=80=99t= know. > > Distributing large amounts of data is a job in itself, and the store > > isn=E2=80=99t designed for that. It could quickly become a bottleneck.= That=E2=80=99s > > one of the reasons why the Guix Workflow Language (GWL) does not store > > scientific data in the store itself. > > and then distributed via standard channels (Zenodo, ...) Thanks for the pointer! > For big datasets, some other mechanism is required. > Big as in bigger than ram? > I think it's worth thinking carefully about how to exploit guix for > reproducible computations. As Lispers know very well, code is data and > data is code. Building a package is a computation like any other. > What I was thinking about, is use guix to distribute data packages just lik= e we distribute softwares from pypi. The advantage of using guix seems obvious, but apparantly it's not desirable or possible and I don't understand why. Scientific workflows could be handled by a specific build system. In > fact, as long as no big datasets or multiple processors are involved, we > can do this right now, using standard package declarations. > Ok, good to know. > It would be nice if big datasets could conceptually be handled in the > same way while being stored elsewhere - a bit like git-annex does for > git. Thanks again for the pointer. And for parallel computing, we could have special build daemons. > That's where OWL comes in? --001a113543163251710564d898f8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Fri= , Feb 9, 2018 at 8:16 PM Konrad Hinsen <konrad.hinsen@fastmail.net> wrote:
Hi,

On 09/02/2018 18:13, Ludovic Court=C3=A8s wrote:

> Amirouche Boubekki <amirouche@hypermove.net> skribis:
>
>> tl;dr: Distribution of data and software seems similar.
>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Data is more and more important i= n software and reproducible
>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0science. Data science ecosystem l= akes resources sharing.
>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0I think guix can help.
>
> Now, whether Guix is the right tool to distribute data, I don=E2=80=99= t know.
> Distributing large amounts of data is a job in itself, and the store > isn=E2=80=99t designed for that.=C2=A0 It could quickly become a bottl= eneck.=C2=A0 That=E2=80=99s
> one of the reasons why the Guix Workflow Language (GWL) does not store=
> scientific data in the store itself.

and then distributed via standard channels (Zenodo, ...)
Thanks for the pointer!
=C2=A0
For big datasets, some other mechanism is required.=

Big as in bigger than ram?
=C2=A0
I think it's worth thinking carefully about how to exploit guix for
reproducible computations. As Lispers know very well, code is data and
data is code. Building a package is a computation like any other.
=C2=A0
What I was thinking about, is use guix to dist= ribute data packages just like
we distribute softwares from pypi.= The advantage of using guix seems obvious,
but apparantly it'= ;s not desirable or possible and I don't understand why.
=
Scientific workflows could be handled by a specific build system. In
fact, as long as no big datasets or multiple processors are involved, we can do this right now, using standard package declarations.

Ok, good to know.
=C2=A0
It would be nice if big datasets could conceptually be handled in the
same way while being stored elsewhere - a bit like git-annex does for
git.

Thanks again for the pointer.

And for parallel computing,= we could have special build daemons.

T= hat's where OWL comes in?
=C2=A0
--001a113543163251710564d898f8--