From mboxrd@z Thu Jan  1 00:00:00 1970
From: Amirouche Boubekki <amirouche.boubekki@gmail.com>
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Sat, 10 Feb 2018 09:51:41 +0000
Message-ID: <CAL7_Mo9kpmuec8krj8SyDK3NciKXj+46MLC==uVSC7a-5GZJAA@mail.gmail.com>
References: <365e13248634ac1e26cf6678611d550d@hypermove.net>
	<87mv0ixf07.fsf@gnu.org>
	<1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="001a113543163251710564d898f8"
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:33147)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <amirouche.boubekki@gmail.com>) id 1ekRp5-0000td-Av
	for guix-devel@gnu.org; Sat, 10 Feb 2018 04:51:56 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <amirouche.boubekki@gmail.com>) id 1ekRp4-0003Or-40
	for guix-devel@gnu.org; Sat, 10 Feb 2018 04:51:55 -0500
Received: from mail-ot0-x22a.google.com ([2607:f8b0:4003:c0f::22a]:44796)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <amirouche.boubekki@gmail.com>)
	id 1ekRp3-0003Np-SF
	for guix-devel@gnu.org; Sat, 10 Feb 2018 04:51:54 -0500
Received: by mail-ot0-x22a.google.com with SMTP id l5so10004003otj.11
	for <guix-devel@gnu.org>; Sat, 10 Feb 2018 01:51:53 -0800 (PST)
In-Reply-To: <1cb709d0-b282-192c-ce1d-20fbff43430e@fastmail.net>
List-Id: "Development of GNU Guix and the GNU System distribution."
	<guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guix-devel/>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: Konrad Hinsen <konrad.hinsen@fastmail.net>
Cc: guix-devel@gnu.org

--001a113543163251710564d898f8
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Feb 9, 2018 at 8:16 PM Konrad Hinsen <konrad.hinsen@fastmail.net>
wrote:

> Hi,
>
> On 09/02/2018 18:13, Ludovic Court=C3=A8s wrote:
>
> > Amirouche Boubekki <amirouche@hypermove.net> skribis:
> >
> >> tl;dr: Distribution of data and software seems similar.
> >>         Data is more and more important in software and reproducible
> >>         science. Data science ecosystem lakes resources sharing.
> >>         I think guix can help.
> >
> > Now, whether Guix is the right tool to distribute data, I don=E2=80=99t=
 know.
> > Distributing large amounts of data is a job in itself, and the store
> > isn=E2=80=99t designed for that.  It could quickly become a bottleneck.=
  That=E2=80=99s
> > one of the reasons why the Guix Workflow Language (GWL) does not store
> > scientific data in the store itself.
>
> and then distributed via standard channels (Zenodo, ...)


Thanks for the pointer!


> For big datasets, some other mechanism is required.
>

Big as in bigger than ram?


> I think it's worth thinking carefully about how to exploit guix for
> reproducible computations. As Lispers know very well, code is data and
> data is code. Building a package is a computation like any other.
>

What I was thinking about, is use guix to distribute data packages just lik=
e
we distribute softwares from pypi. The advantage of using guix seems
obvious,
but apparantly it's not desirable or possible and I don't understand why.

Scientific workflows could be handled by a specific build system. In
> fact, as long as no big datasets or multiple processors are involved, we
> can do this right now, using standard package declarations.
>

Ok, good to know.


> It would be nice if big datasets could conceptually be handled in the
> same way while being stored elsewhere - a bit like git-annex does for
> git.


Thanks again for the pointer.

And for parallel computing, we could have special build daemons.
>

That's where OWL comes in?

--001a113543163251710564d898f8
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Fri=
, Feb 9, 2018 at 8:16 PM Konrad Hinsen &lt;<a href=3D"mailto:konrad.hinsen@=
fastmail.net">konrad.hinsen@fastmail.net</a>&gt; wrote:<br></div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">Hi,<br>
<br>
On 09/02/2018 18:13, Ludovic Court=C3=A8s wrote:<br>
<br>
&gt; Amirouche Boubekki &lt;<a href=3D"mailto:amirouche@hypermove.net" targ=
et=3D"_blank">amirouche@hypermove.net</a>&gt; skribis:<br>
&gt;<br>
&gt;&gt; tl;dr: Distribution of data and software seems similar.<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Data is more and more important i=
n software and reproducible<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0science. Data science ecosystem l=
akes resources sharing.<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0I think guix can help.<br>
&gt;<br>
&gt; Now, whether Guix is the right tool to distribute data, I don=E2=80=99=
t know.<br>
&gt; Distributing large amounts of data is a job in itself, and the store<b=
r>
&gt; isn=E2=80=99t designed for that.=C2=A0 It could quickly become a bottl=
eneck.=C2=A0 That=E2=80=99s<br>
&gt; one of the reasons why the Guix Workflow Language (GWL) does not store=
<br>
&gt; scientific data in the store itself.<br>
<br>and then distributed via standard channels (Zenodo, ...)</blockquote><d=
iv><br></div><div>Thanks for the pointer!<br></div><div>=C2=A0</div><blockq=
uote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc =
solid;padding-left:1ex">For big datasets, some other mechanism is required.=
<br></blockquote><div><br></div><div>Big as in bigger than ram?<br></div><d=
iv>=C2=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">
I think it&#39;s worth thinking carefully about how to exploit guix for<br>
reproducible computations. As Lispers know very well, code is data and<br>
data is code. Building a package is a computation like any other.<br></bloc=
kquote><div>=C2=A0</div><div>What I was thinking about, is use guix to dist=
ribute data packages just like</div><div>we distribute softwares from pypi.=
 The advantage of using guix seems obvious,</div><div>but apparantly it&#39=
;s not desirable or possible and I don&#39;t understand why.<br></div><div>=
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
Scientific workflows could be handled by a specific build system. In<br>
fact, as long as no big datasets or multiple processors are involved, we<br=
>
can do this right now, using standard package declarations.<br></blockquote=
><div><br></div><div>Ok, good to know.<br></div><div>=C2=A0<br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">
It would be nice if big datasets could conceptually be handled in the<br>
same way while being stored elsewhere - a bit like git-annex does for<br>
git. </blockquote><div><br></div><div>Thanks again for the pointer.<br></di=
v><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex">And for parallel computing,=
 we could have special build daemons.<br></blockquote><div><br></div><div>T=
hat&#39;s where OWL comes in?<br></div><div>=C2=A0</div></div></div>

--001a113543163251710564d898f8--