Use guix to distribute data & reproducible (data) science

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Use guix to distribute data & reproducible (data) science
@ 2018-02-09 16:32 Amirouche Boubekki
  2018-02-09 17:13 ` Ludovic Courtès
  0 siblings, 1 reply; 23+ messages in thread
From: Amirouche Boubekki @ 2018-02-09 16:32 UTC (permalink / raw)
  To: Guix Devel

Héllo all,

tl;dr: Distribution of data and software seems similar.
        Data is more and more important in software and reproducible
        science. Data science ecosystem lakes resources sharing.
        I think guix can help.

Recently I stumbled upon open data movement and its links with
data science.

To give a high level overview, there is several (web) platforms
that allows administrations and companies to publish data and
_distribute_ it. Example of such platforms are data.gouv.fr [1] and
various other platforms based on CKAN [2].

[1] https://www.data.gouv.fr/
[2] https://okfn.org/projects/

I have worked with data.gouv.fr in particular. And the repository
is rather poor in terms of quality. Making very difficult to use.

The other side of this open data and data based software is the
fact that some software provide their own mechanism to _distribute_
data or binary blobs called 'models' that are sometime based on
libre data. Example of such softwares are spacy [2], gensim [3],
nltk [4] and word2vec.

[2] https://spacy.io/
[3] https://en.wikipedia.org/wiki/Gensim
[4] http://www.nltk.org/

My last point is that it's common knowledge that data wrangling
aka. cleaning and preparing data is 80% of data scientist job.
It's required because data distributors don't do it right, because
they don't have the man power and the knowledge to do it right.

To summarize:

1) Some software and platforms distribute _data_ themselves in some
    "closed garden" way. It's not the role of software to distribute
    data especially when that data can be reused in other contexts.

2) models are binary blobs that you use in the hope they do what they
    are supposed to do. How do you build the model? Is the model
    reproducible?

3) Preparing data must be re-done all the time, let's share resource
    and do it once.

It seems to me that guix has all the required feature to handle data
and models distribution.

What do people think? Do we already use guix to distribute data and 
models.

Also, it seems good to surf on AI frenzy ;)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
@ 2018-02-09 17:13 ` Ludovic Courtès
  2018-02-09 17:48   ` zimoun
  2018-02-09 19:15   ` Konrad Hinsen
  0 siblings, 2 replies; 23+ messages in thread
From: Ludovic Courtès @ 2018-02-09 17:13 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guix Devel

Hi!

Amirouche Boubekki <amirouche@hypermove.net> skribis:

> tl;dr: Distribution of data and software seems similar.
>        Data is more and more important in software and reproducible
>        science. Data science ecosystem lakes resources sharing.
>        I think guix can help.

I think some of us especially Guix-HPC folks are convinced about the
usefulness of Guix as one of the tools in the reproducible science
toolchain (that was one of the themes of my FOSDEM talk).  :-)

Now, whether Guix is the right tool to distribute data, I don’t know.
Distributing large amounts of data is a job in itself, and the store
isn’t designed for that.  It could quickly become a bottleneck.  That’s
one of the reasons why the Guix Workflow Language (GWL) does not store
scientific data in the store itself.

I think data should probably be stored and distributed out-of-band using
appropriate storage mechanisms.

Ludo’.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 17:13 ` Ludovic Courtès
@ 2018-02-09 17:48   ` zimoun
  2018-02-09 19:15   ` Konrad Hinsen
  1 sibling, 0 replies; 23+ messages in thread
From: zimoun @ 2018-02-09 17:48 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel, Amirouche Boubekki

Dear,

From my understanding, what you are describing is what bioinfo guys
call a workflow:

 1- fetch data here and there
 2- clean and prepare data
 3- compute stuff with these data
 4- obtain an answer
and loop several times on several data sets.

Guix Workflow Language allows to implement the workflow, i.e., all the
steps and their link to deal with the data.
And because Guix, reproducibility in terms of softwares comes for almost free.
Moreover, if there is some channel mechanism, then there is a way to
share these workflows.

I think the tools are there, modulo UI and corner cases. :-)

From my point of view, workflows are missing because of manpower
(lispy guy, etc.).

Last, a workflow is not necessary reproducible bit-to-bit since some
algorithms use randomness.

Hope that helps.

All the best,
simon

On 9 February 2018 at 18:13, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
> Hi!
>
> Amirouche Boubekki <amirouche@hypermove.net> skribis:
>
>> tl;dr: Distribution of data and software seems similar.
>>        Data is more and more important in software and reproducible
>>        science. Data science ecosystem lakes resources sharing.
>>        I think guix can help.
>
> I think some of us especially Guix-HPC folks are convinced about the
> usefulness of Guix as one of the tools in the reproducible science
> toolchain (that was one of the themes of my FOSDEM talk).  :-)
>
> Now, whether Guix is the right tool to distribute data, I don’t know.
> Distributing large amounts of data is a job in itself, and the store
> isn’t designed for that.  It could quickly become a bottleneck.  That’s
> one of the reasons why the Guix Workflow Language (GWL) does not store
> scientific data in the store itself.
>
> I think data should probably be stored and distributed out-of-band using
> appropriate storage mechanisms.
>
> Ludo’.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 17:13 ` Ludovic Courtès
  2018-02-09 17:48   ` zimoun
@ 2018-02-09 19:15   ` Konrad Hinsen
  2018-02-09 23:01     ` zimoun
                       ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Konrad Hinsen @ 2018-02-09 19:15 UTC (permalink / raw)
  To: guix-devel

Hi,

On 09/02/2018 18:13, Ludovic Courtès wrote:

> Amirouche Boubekki <amirouche@hypermove.net> skribis:
> 
>> tl;dr: Distribution of data and software seems similar.
>>         Data is more and more important in software and reproducible
>>         science. Data science ecosystem lakes resources sharing.
>>         I think guix can help.
> 
> Now, whether Guix is the right tool to distribute data, I don’t know.
> Distributing large amounts of data is a job in itself, and the store
> isn’t designed for that.  It could quickly become a bottleneck.  That’s
> one of the reasons why the Guix Workflow Language (GWL) does not store
> scientific data in the store itself.

I'd say it depends on the data and how it is used inside and outside of 
a workflow. Some data could very well stored in the store, and then 
distributed via standard channels (Zenodo, ...) after export by "guix 
pack". For big datasets, some other mechanism is required.

I think it's worth thinking carefully about how to exploit guix for 
reproducible computations. As Lispers know very well, code is data and 
data is code. Building a package is a computation like any other. 
Scientific workflows could be handled by a specific build system. In 
fact, as long as no big datasets or multiple processors are involved, we 
can do this right now, using standard package declarations.

It would be nice if big datasets could conceptually be handled in the 
same way while being stored elsewhere - a bit like git-annex does for 
git. And for parallel computing, we could have special build daemons.

Konrad.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 19:15   ` Konrad Hinsen
@ 2018-02-09 23:01     ` zimoun
  2018-02-09 23:17       ` Ricardo Wurmus
  2018-02-12 11:46       ` Konrad Hinsen
  2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
  2018-02-14 13:06     ` Ludovic Courtès
  2 siblings, 2 replies; 23+ messages in thread
From: zimoun @ 2018-02-09 23:01 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: Guix Devel

Hi,

> I'd say it depends on the data and how it is used inside and outside of a
> workflow. Some data could very well stored in the store, and then
> distributed via standard channels (Zenodo, ...) after export by "guix pack".
> For big datasets, some other mechanism is required.

I am not sure to understand the point.
From my point of view, there is 2 kind of datasets:
 a- the ones which are part of the software, e.g., used to pass the
tests. Therefore, they are usually small, not always;
 b- the ones which are applied to the software and somehow they are
not in the source repository. They are big or not.

I do not know if some policy is established in guix about the case a-,
not sure that it is possible in fact (e.g., include Whole Genome fasta
to test alignment tools ? etc.).

It does not appear to me a good idea to try to include in the store
datasets of case b-.
Is it not the job of data management tools ? e.g., database etc.

I do not know so much, but a idea should to write a workflow: you
fetch the data, you clean them and you check by hashing that the
result is the expected one. Only the softwares used to do that are in
the store. The input and output data are not, but your workflow check
that they are the expected ones.
However, it depends on what we are calling 'cleaning' because some
algorithms are not deterministic.

Hum? I do not know if there is some mechanism in GWL to check the hash
of the `data-inputs' field.

> I think it's worth thinking carefully about how to exploit guix for
> reproducible computations. As Lispers know very well, code is data and data
> is code. Building a package is a computation like any other. Scientific
> workflows could be handled by a specific build system. In fact, as long as
> no big datasets or multiple processors are involved, we can do this right
> now, using standard package declarations.

It appear to me as a complement of these points ---and personnally, I
learn some points about the design of GWL--- with this thread:
https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html

> It would be nice if big datasets could conceptually be handled in the same
> way while being stored elsewhere - a bit like git-annex does for git. And
> for parallel computing, we could have special build daemons.

Hum? the point is to add data management a la git-annex to GWL ? Is it ?

Have a nice week-end !
simon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 23:01     ` zimoun
@ 2018-02-09 23:17       ` Ricardo Wurmus
  2018-02-12 11:46       ` Konrad Hinsen
  1 sibling, 0 replies; 23+ messages in thread
From: Ricardo Wurmus @ 2018-02-09 23:17 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel


zimoun <zimon.toutoune@gmail.com> writes:

> I do not know so much, but a idea should to write a workflow: you
> fetch the data, you clean them and you check by hashing that the
> result is the expected one. Only the softwares used to do that are in
> the store. The input and output data are not, but your workflow check
> that they are the expected ones.
> However, it depends on what we are calling 'cleaning' because some
> algorithms are not deterministic.
>
> Hum? I do not know if there is some mechanism in GWL to check the hash
> of the `data-inputs' field.

In the GWL the data-inputs field is not special as far as any of the
current execution engines are concerned.  It’s up to the execution
engine to implement recency checks or identity checks as there is not
one size that fits all inputs.

-- 
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 23:01     ` zimoun
  2018-02-09 23:17       ` Ricardo Wurmus
@ 2018-02-12 11:46       ` Konrad Hinsen
  2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
  1 sibling, 1 reply; 23+ messages in thread
From: Konrad Hinsen @ 2018-02-12 11:46 UTC (permalink / raw)
  To: Guix Devel

Hi everyone,

zimoun <zimon.toutoune@gmail.com> writes:

> From my point of view, there is 2 kind of datasets:
>  a- the ones which are part of the software, e.g., used to pass the
> tests. Therefore, they are usually small, not always;
>  b- the ones which are applied to the software and somehow they are
> not in the source repository. They are big or not.

I was thinking of the second kind only.

> It does not appear to me a good idea to try to include in the store
> datasets of case b-.
> Is it not the job of data management tools ? e.g., database etc.

It depends. The frontier between data and code is not as clear as it may
seem. An example: the weights of a trained neural network can be seen as
data (a bunch of numbers), but also as code for a special-purpose
processor defined by the neural network.

Starting from that example, consider that the weights of a neural
network are not fundamentally different from fit parameters in other
scientific models. For example the positions of the atoms in a protein
structure. Using the same analogy as for the neural network, these
positions are the code for a special-purpose processor that computes
estimations for the Bragg reflections that are measured in protein
crystallography.

On the other hand, any sufficiently small dataset can be replaced by a
piece of code that defines a literal data structure. Does that turn the
data into code or not?

My point is that trying to define a distinction between data and code is
always arbitrary and rarely of practical interest. I prefer to take a
pragmatic stance and ask the question: what are the advantages and
problems associated with managing some piece of data in the store? And I
suspect that exploring this question for a couple of applications will
lead to new ways to profit from the store.

However, distribution of data is an entirely different question from
managing data on a single machine. I have no idea how well suited the
store is for distributing data (Hydra does exactly that, right?), so
I'll happily listen to the Guix experts.

> It appear to me as a complement of these points ---and personnally, I
> learn some points about the design of GWL--- with this thread:
> https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html

Thanks for the pointer!

>> It would be nice if big datasets could conceptually be handled in the same
>> way while being stored elsewhere - a bit like git-annex does for git. And
>> for parallel computing, we could have special build daemons.
>
> Hum? the point is to add data management a la git-annex to GWL ? Is it ?

At least consider it - I don't know where that will lead.

Amirouche Boubekki <amirouche.boubekki@gmail.com> writes:

>> For big datasets, some other mechanism is required.
>
> Big as in bigger than ram?

Bigger that one is willing to handle as files that are copied between
machines for distribution. For me, personally and at this moment in
time, that's somewhere between 1 GB and 10 GB.

> What I was thinking about, is use guix to distribute data packages
> just like we distribute softwares from pypi. The advantage of using
> guix seems obvious, but apparantly it's not desirable or possible and
> I don't understand why.

I think there are two questions:
 1. Can guix and its current infrastructure be used for data
 distribution?
 2. Can a yet-to-be-designed data distribution infrastructure
 use the guix store as its storage backend?

My understanding from Ludo's comment is that 1) may not be a good idea,
but that still leaves 2) as a topic for exploration.

> And for parallel computing, we could have special build daemons.
>
> That's where OWL comes in?

Exactly.

Konrad.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Do you use packages in Guix to run neural networks?
  2018-02-12 11:46       ` Konrad Hinsen
@ 2018-02-14  4:43         ` Fis Trivial
  2018-02-14  6:07           ` Pjotr Prins
  2018-02-14  8:04           ` Konrad Hinsen
  0 siblings, 2 replies; 23+ messages in thread
From: Fis Trivial @ 2018-02-14  4:43 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: Guix Devel


>
> It depends. The frontier between data and code is not as clear as it may
> seem. An example: the weights of a trained neural network can be seen as
> data (a bunch of numbers), but also as code for a special-purpose
> processor defined by the neural network.
>
> Starting from that example, consider that the weights of a neural
> network are not fundamentally different from fit parameters in other
> scientific models. For example the positions of the atoms in a protein
> structure. Using the same analogy as for the neural network, these
> positions are the code for a special-purpose processor that computes
> estimations for the Bragg reflections that are measured in protein
> crystallography.
>

Sorry for bothering with a completely unrelated topic.
I'm curious do you train neural network with packages in Guix? Or did
you packaged related libraries yourself?

I would love to let guix handle all machine learning libraries for me,
but many of them, especially those nn libraries, require GPU to
accelerate the speed. But, AFAIK, currently there are no GPU compute
stack can meet GNU's standard. Unable to put it upstream somehow makes
me lazy to package them.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Do you use packages in Guix to run neural networks?
  2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
@ 2018-02-14  6:07           ` Pjotr Prins
  2018-02-14  7:27             ` Fis Trivial
  2018-02-14  8:04           ` Konrad Hinsen
  1 sibling, 1 reply; 23+ messages in thread
From: Pjotr Prins @ 2018-02-14  6:07 UTC (permalink / raw)
  To: Fis Trivial; +Cc: Guix Devel

On Wed, Feb 14, 2018 at 04:43:50AM +0000, Fis Trivial wrote:
> Sorry for bothering with a completely unrelated topic.
> I'm curious do you train neural network with packages in Guix? Or did
> you packaged related libraries yourself?
> 
> I would love to let guix handle all machine learning libraries for me,
> but many of them, especially those nn libraries, require GPU to
> accelerate the speed. But, AFAIK, currently there are no GPU compute
> stack can meet GNU's standard. Unable to put it upstream somehow makes
> me lazy to package them.

Dennis did some work packaging opencl and arrayfire. It is in here:

  https://gitlab.com/genenetwork/guix-bioinformatics/tree/master/gn/packages

and may be a reasonable starting point. 

For the propriety stuff, it won't make it into trunk, but you can
still package it and share.

Pj.
-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Do you use packages in Guix to run neural networks?
  2018-02-14  6:07           ` Pjotr Prins
@ 2018-02-14  7:27             ` Fis Trivial
  0 siblings, 0 replies; 23+ messages in thread
From: Fis Trivial @ 2018-02-14  7:27 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: Guix Devel


> On Wed, Feb 14, 2018 at 04:43:50AM +0000, Fis Trivial wrote:
>> Sorry for bothering with a completely unrelated topic.
>> I'm curious do you train neural network with packages in Guix? Or did
>> you packaged related libraries yourself?
>> 
>> I would love to let guix handle all machine learning libraries for me,
>> but many of them, especially those nn libraries, require GPU to
>> accelerate the speed. But, AFAIK, currently there are no GPU compute
>> stack can meet GNU's standard. Unable to put it upstream somehow makes
>> me lazy to package them.
>
> Dennis did some work packaging opencl and arrayfire. It is in here:
>
>   https://gitlab.com/genenetwork/guix-bioinformatics/tree/master/gn/packages
>
> and may be a reasonable starting point. 
>
> For the propriety stuff, it won't make it into trunk, but you can
> still package it and share.
>
> Pj.

Thanks. It will take some time to work things out, but this is starting point.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Do you use packages in Guix to run neural networks?
  2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
  2018-02-14  6:07           ` Pjotr Prins
@ 2018-02-14  8:04           ` Konrad Hinsen
  1 sibling, 0 replies; 23+ messages in thread
From: Konrad Hinsen @ 2018-02-14  8:04 UTC (permalink / raw)
  To: Fis Trivial, Guix Devel

On 14/02/2018 05:43, Fis Trivial wrote:

> Sorry for bothering with a completely unrelated topic.
> I'm curious do you train neural network with packages in Guix? Or did
> you packaged related libraries yourself?

My needs are modest, all I use is the multilayer perceptron from 
scikit-learn, which is already packaged in Guix.

Konrad.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 19:15   ` Konrad Hinsen
  2018-02-09 23:01     ` zimoun
@ 2018-02-10  9:51     ` Amirouche Boubekki
  2018-02-10 11:28       ` zimoun
  2018-02-14 13:06     ` Ludovic Courtès
  2 siblings, 1 reply; 23+ messages in thread
From: Amirouche Boubekki @ 2018-02-10  9:51 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1979 bytes --]

On Fri, Feb 9, 2018 at 8:16 PM Konrad Hinsen <konrad.hinsen@fastmail.net>
wrote:

> Hi,
>
> On 09/02/2018 18:13, Ludovic Courtès wrote:
>
> > Amirouche Boubekki <amirouche@hypermove.net> skribis:
> >
> >> tl;dr: Distribution of data and software seems similar.
> >>         Data is more and more important in software and reproducible
> >>         science. Data science ecosystem lakes resources sharing.
> >>         I think guix can help.
> >
> > Now, whether Guix is the right tool to distribute data, I don’t know.
> > Distributing large amounts of data is a job in itself, and the store
> > isn’t designed for that.  It could quickly become a bottleneck.  That’s
> > one of the reasons why the Guix Workflow Language (GWL) does not store
> > scientific data in the store itself.
>
> and then distributed via standard channels (Zenodo, ...)


Thanks for the pointer!


> For big datasets, some other mechanism is required.
>

Big as in bigger than ram?


> I think it's worth thinking carefully about how to exploit guix for
> reproducible computations. As Lispers know very well, code is data and
> data is code. Building a package is a computation like any other.
>

What I was thinking about, is use guix to distribute data packages just like
we distribute softwares from pypi. The advantage of using guix seems
obvious,
but apparantly it's not desirable or possible and I don't understand why.

Scientific workflows could be handled by a specific build system. In
> fact, as long as no big datasets or multiple processors are involved, we
> can do this right now, using standard package declarations.
>

Ok, good to know.


> It would be nice if big datasets could conceptually be handled in the
> same way while being stored elsewhere - a bit like git-annex does for
> git.


Thanks again for the pointer.

And for parallel computing, we could have special build daemons.
>

That's where OWL comes in?

[-- Attachment #2: Type: text/html, Size: 3232 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
@ 2018-02-10 11:28       ` zimoun
  0 siblings, 0 replies; 23+ messages in thread
From: zimoun @ 2018-02-10 11:28 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guix Devel

Hi,

Thank you for the topic feeding my thoughts.
And thank you Ricardo for your explanations.

> What I was thinking about, is use guix to distribute data packages just like
> we distribute softwares from pypi. The advantage of using guix seems
> obvious,
> but apparantly it's not desirable or possible and I don't understand why.

Are you talking to package a way to fetch the data ?
The first Debian example I found:
https://packages.debian.org/fr/stretch/astrometry-data-2mass-00

Or to package the dataset itself ?
Which does not seem affordable in term of resources (bandwith+disk), is it ?

Last, when considering large dataset --say hundred or more samples of
GB-- then hashing becomes the bottleneck.

All the best,
simon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-09 19:15   ` Konrad Hinsen
  2018-02-09 23:01     ` zimoun
  2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
@ 2018-02-14 13:06     ` Ludovic Courtès
  2018-02-15 17:10       ` zimoun
  2 siblings, 1 reply; 23+ messages in thread
From: Ludovic Courtès @ 2018-02-14 13:06 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: guix-devel

Hello,

Konrad Hinsen <konrad.hinsen@fastmail.net> skribis:

> It would be nice if big datasets could conceptually be handled in the
> same way while being stored elsewhere - a bit like git-annex does for
> git. And for parallel computing, we could have special build daemons.

Exactly.  I think we need a git-annex/git-lfs-like tool for the store.
(It could also be useful for things like secrets, which we don’t want to
have in the store.)

Ludo’.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-14 13:06     ` Ludovic Courtès
@ 2018-02-15 17:10       ` zimoun
  2018-02-16  9:28         ` Konrad Hinsen
  2018-02-16 12:41         ` Amirouche Boubekki
  0 siblings, 2 replies; 23+ messages in thread
From: zimoun @ 2018-02-15 17:10 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi,

Thank you for this food for thought.

I agree that the frontier between code and data is arbitary.

However, I am not sure to get the picture about the data management in
the context of Reproducible Science. What is the issue ?

So, I catch your invitation to explore your idea. :-)

Let think about the old lab experiment. On one hand, you have your
protocol and the description of all the steps. On the other hand, you
have measurements and results. Then, I am able to imagine a sense of
some bit-to-bit mechanism for the protocol part. I am not sure about the
measurements part.

Well, protocol is code or workflow; measurements are data.
And I agree that e.g., information of electronic orbits or weights of a
trained neural network is sometimes part of the protocol. :-)

For me, just talking about code, it is not a straightforward task to
define what are the properties for a reproducible and fully controled
computational environment. It is --I guess-- what Guix is defining
(transactional, user-profile, hackable, etc.). Then, it appears to me
even more difficult about data.

What are such properties for data management ?

In other words, on the paper, what are the benefits of a management of
some piece of data in the store ? For example for the applications of
weights of a trained neural network; or of the positions of the atoms in
protein structure.

For me --maybe I have wrong-- the way is to define a package (or
workflow) that fetches the data from some external source, cleans if
needed, does some checks, and then puts it to /path/to/somewhere/
outside the store. In parallel computing, this /path/to/somewhere/ is
accessible by all the nodes. Moreover, this /path/to/somewhere/ contains
something hash-based in the folder name.

Is it not enough ?

Why do you need the history of changes ? as git provide ?

Secrets is another story than reproducible science toolchain, I guess.

Thank you again.

All the best,
simon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-15 17:10       ` zimoun
@ 2018-02-16  9:28         ` Konrad Hinsen
  2018-02-16 14:33           ` myglc2
  2018-02-16 12:41         ` Amirouche Boubekki
  1 sibling, 1 reply; 23+ messages in thread
From: Konrad Hinsen @ 2018-02-16  9:28 UTC (permalink / raw)
  To: zimoun, Guix Devel

Hi,

> In other words, on the paper, what are the benefits of a management of
> some piece of data in the store ? For example for the applications of
> weights of a trained neural network; or of the positions of the atoms in
> protein structure.

Provenance tracking. In a complex data processing workflow, it is
important to know which computations were done in which order using
which software. This is technically almost the same as software
dependency tracking, so it would be nice to re-use the Guix
infrastructure for this.

> For me --maybe I have wrong-- the way is to define a package (or
> workflow) that fetches the data from some external source, cleans if
> needed, does some checks, and then puts it to /path/to/somewhere/
> outside the store. In parallel computing, this /path/to/somewhere/ is
> accessible by all the nodes. Moreover, this /path/to/somewhere/ contains
> something hash-based in the folder name.
>
> Is it not enough ?

Whether for software or for data, dependencies are DAGs whose terminal
nodes are measuremnts (for data) or human-supplied information (code,
parameters, methodological choices). Guix handles the latter very well.

The three missing pieces are:

 - Dealing with measurements, which might involve interacting with
   experimental equipment or databases. Moreover, since data from
   such sources can change, its hash in the store must be computed
   from the contents, not just from the reference to the contents.

 - Dealing with data that is much larger than what Guix handles
   well in the store.

 - Exporting results with provenance tracking to the outside world,
   which may not be using Guix. Big data aside, this could take the
   form of a new output format to "guix pack", for example a
   research object (http://www.researchobject.org/) or a Reproducible
   Document Archive (https://github.com/substance/dar).

> Why do you need the history of changes ? as git provide ?

I'd say git is fine for everything not "too big".

> Secrets is another story than reproducible science toolchain, I guess.

Yes, indeed. And not something I need to deal with, so I will shut up
now!

Konrad.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-16  9:28         ` Konrad Hinsen
@ 2018-02-16 14:33           ` myglc2
  2018-02-16 15:20             ` Konrad Hinsen
  0 siblings, 1 reply; 23+ messages in thread
From: myglc2 @ 2018-02-16 14:33 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: Guix Devel

Hi Konrad,

On 02/16/2018 at 10:28 Konrad Hinsen writes:

> Whether for software or for data, dependencies are DAGs whose terminal
> nodes are measuremnts (for data) or human-supplied information (code,
> parameters, methodological choices). Guix handles the latter very well.
>
> The three missing pieces are:
>
>  - Dealing with measurements, which might involve interacting with
>    experimental equipment or databases. Moreover, since data from
>    such sources can change, its hash in the store must be computed
>    from the contents, not just from the reference to the contents.

Why not "enclose" a measurement set and it's provenance in a git
"package"?

- George

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-16 14:33           ` myglc2
@ 2018-02-16 15:20             ` Konrad Hinsen
  0 siblings, 0 replies; 23+ messages in thread
From: Konrad Hinsen @ 2018-02-16 15:20 UTC (permalink / raw)
  To: myglc2, Guix Devel

Hi George,

myglc2@gmail.com writes:

>> The three missing pieces are:
>>
>>  - Dealing with measurements, which might involve interacting with
>>    experimental equipment or databases. Moreover, since data from
>>    such sources can change, its hash in the store must be computed
>>    from the contents, not just from the reference to the contents.
>
> Why not "enclose" a measurement set and it's provenance in a git
> "package"?

For text-based data of reasonable size, that's an option. Many people
are already using git for data management. But then, data is so diverse
that no rule and no tool will satisfy everybody's requirements.

Konrad.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-15 17:10       ` zimoun
  2018-02-16  9:28         ` Konrad Hinsen
@ 2018-02-16 12:41         ` Amirouche Boubekki
  1 sibling, 0 replies; 23+ messages in thread
From: Amirouche Boubekki @ 2018-02-16 12:41 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 3423 bytes --]

On Thu, Feb 15, 2018 at 6:11 PM zimoun <zimon.toutoune@gmail.com> wrote:

> Hi,
>
> Thank you for this food for thought.
>
>
> I agree that the frontier between code and data is arbitary.
>
> However, I am not sure to get the picture about the data management in
> the context of Reproducible Science. What is the issue ?
>
> So, I catch your invitation to explore your idea. :-)
>

[...]

> For me, just talking about code, it is not a straightforward task to
> define what are the properties for a reproducible and fully controled
> computational environment. It is --I guess-- what Guix is defining
> (transactional, user-profile, hackable, etc.). Then, it appears to me
> even more difficult about data.
>

> What are such properties for data management ?
>

In other words, on the paper, what are the benefits of a management of
> some piece of data in the store ? For example for the applications of
> weights of a trained neural network; or of the positions of the atoms in
> protein structure.
>

Given version-ed datasets you could want to switch
the input dataset of a given "pipeline" to see how different data
produce different results.

Also, it is desirable to be able to re-start a "pipeline" when a
datasets is updated.

For me --maybe I have wrong-- the way is to define a package (or
> workflow) that fetches the data from some external source, cleans if
> needed, does some checks, and then puts it to /path/to/somewhere/
> outside the store. In parallel computing, this /path/to/somewhere/ is
> accessible by all the nodes. Moreover, this /path/to/somewhere/ contains
> something hash-based in the folder name.
>
> Is it not enough ?
>

It is not enough, you could need to make a diff between two
datasets which is not easily done if the data is stored in tarballs.
But that is not a case that can be handled by guix.

Why do you need the history of changes ? as git provide ?
>

Because, if the dataset introduce a change that is not handled by
the rest of the code you can get to know it by looking up the diff. For
instance, a column that is an enumeration of three values that has now a
fourth. But again, it's not a case that's meant to be handled by guix.

Like others have said, there is different kind of data and - even if it was
possible to handle large datasets in guix store - it would require also a
lot of space and computation power. ConceptNet 5.5.5 is 10G which takes
more than a dozen of hours to build and AFAIK is not reproducible since it
takes its input directly on live instance of the wiktionary. WikiData is
100G, but requires no processing power. Those are structured data that you
could want to version in something like git. But, things like spacy models
<https://spacy.io/models/en> that are around 1G take I guess around a few
hours to build, are not structured. Those are the data that I know about
and there very few of them compared to small datasets (see data.gouv.fr)

I think they are various opportunities around reproducible data science. In
particular, I see two main opportunities :

a) Packaging data and work with upstream to keep the data clean. A work
that is already handled by privaters and in initiatives like
http://datahub.io/

b) Cooperate around the making of datasets, some kind of *git for data* for
which there is few or no initiatives. FWIW, I started such a project I call
neon <http://www.hyperdev.fr/projects/neon/>.

Thanks for the feedback.

[-- Attachment #2: Type: text/html, Size: 4880 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
@ 2018-02-16 16:43 Amirouche Boubekki
  2018-02-17 22:21 ` Roel Janssen
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Amirouche Boubekki @ 2018-02-16 16:43 UTC (permalink / raw)
  To: ludovic.courtes; +Cc: Guix Devel

Hello again Ludovic,

On 2018-02-09 18:13, ludovic.courtes@inria.fr wrote:
> Hi!
> 
> Amirouche Boubekki <amirouche@hypermove.net> skribis:
> 
>> tl;dr: Distribution of data and software seems similar.
>>        Data is more and more important in software and reproducible
>>        science. Data science ecosystem lakes resources sharing.
>>        I think guix can help.
> 
> I think some of us especially Guix-HPC folks are convinced about the
> usefulness of Guix as one of the tools in the reproducible science
> toolchain (that was one of the themes of my FOSDEM talk).  :-)
> 
> Now, whether Guix is the right tool to distribute data, I don’t know.
> Distributing large amounts of data is a job in itself, and the store
> isn’t designed for that.  It could quickly become a bottleneck.

What does it mean technically that the store “isn't designed for that”?

> That’s one of the reasons why the Guix Workflow Language (GWL)
> does not store scientific data in the store itself.

Sorry, I did not follow the engineering discussion around GWL.
Looking up the web brings me [0]. That said the question I am
asking is not answered there. In particular there is no rationale
for that in the design paper.

[0] http://lists.gnu.org/archive/html/guix-devel/2016-10/msg01248.html

> I think data should probably be stored and distributed out-of-band 
> using
> appropriate storage mechanisms.

Then, in a follow up mail, you reply to Konrad:

>> Konrad Hinsen <konrad.hinsen@fastmail.net> skribis:
> 
> [...]
> 
>> It would be nice if big datasets could conceptually be handled in the
>> same way while being stored elsewhere - a bit like git-annex does for
>> git. And for parallel computing, we could have special build daemons.
> 
> Exactly.  I think we need a git-annex/git-lfs-like tool for the store.
> (It could also be useful for things like secrets, which we don’t want
> to have in the store.)
> 


-

" The most basic of all human needs is the need to understand and be 
understood " Ralph Nichols

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-16 16:43 Amirouche Boubekki
@ 2018-02-17 22:21 ` Roel Janssen
  2018-02-18 23:42 ` Ludovic Courtès
  2018-02-19  7:57 ` Ricardo Wurmus
  2 siblings, 0 replies; 23+ messages in thread
From: Roel Janssen @ 2018-02-17 22:21 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guix Devel, ludovic.courtes


Amirouche Boubekki writes:

> Hello again Ludovic,
>
> On 2018-02-09 18:13, ludovic.courtes@inria.fr wrote:
>> Hi!
>> 
>> Amirouche Boubekki <amirouche@hypermove.net> skribis:
>> 
>>> tl;dr: Distribution of data and software seems similar.
>>>        Data is more and more important in software and reproducible
>>>        science. Data science ecosystem lakes resources sharing.
>>>        I think guix can help.
>> 
>> I think some of us especially Guix-HPC folks are convinced about the
>> usefulness of Guix as one of the tools in the reproducible science
>> toolchain (that was one of the themes of my FOSDEM talk).  :-)
>> 
>> Now, whether Guix is the right tool to distribute data, I don’t know.
>> Distributing large amounts of data is a job in itself, and the store
>> isn’t designed for that.  It could quickly become a bottleneck.
>
> What does it mean technically that the store “isn't designed for that”?
>
>> That’s one of the reasons why the Guix Workflow Language (GWL)
>> does not store scientific data in the store itself.
>
> Sorry, I did not follow the engineering discussion around GWL.
> Looking up the web brings me [0]. That said the question I am
> asking is not answered there. In particular there is no rationale
> for that in the design paper.
>
> [0] http://lists.gnu.org/archive/html/guix-devel/2016-10/msg01248.html
>
>> I think data should probably be stored and distributed out-of-band 
>> using
>> appropriate storage mechanisms.
>
> Then, in a follow up mail, you reply to Konrad:
>
>>> Konrad Hinsen <konrad.hinsen@fastmail.net> skribis:
>> 
>> [...]
>> 
>>> It would be nice if big datasets could conceptually be handled in the
>>> same way while being stored elsewhere - a bit like git-annex does for
>>> git. And for parallel computing, we could have special build daemons.
>> 
>> Exactly.  I think we need a git-annex/git-lfs-like tool for the store.
>> (It could also be useful for things like secrets, which we don’t want
>> to have in the store.)
>> 

To answer your question:
> What does it mean technically that the store “isn't designed for that”?

I speak only from my own experience with “big data sets”, so may be it
is different for other people, but we use a separate storage system for
storing large amounts of data.  This separate storage is fault-tolerant
and is optimized for large files, meaning higher latency for file access
to reduce the financial footprint of such a system.

If we were to put data inside the store, we would need to optimize the
storage system for both low latency for small files, and a high storage
capacity.  This is extremely expensive.

Another issue I faced when providing datasets in the store is that
it's quite easy to end up with duplicated copies of the same dataset.

For example, I use the GNU build system for extracting a tarball that
contains a couple of files.  Whenever a package changes that affects the
GNU build system, the data package will be rebuild.

So you could use the trivial build system, but then I'd still need tar
and gzip to unpack the tarball.  Any change to these and the datasets
get duplicated.  This is not ideal.

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-16 16:43 Amirouche Boubekki
  2018-02-17 22:21 ` Roel Janssen
@ 2018-02-18 23:42 ` Ludovic Courtès
  2018-02-19  7:57 ` Ricardo Wurmus
  2 siblings, 0 replies; 23+ messages in thread
From: Ludovic Courtès @ 2018-02-18 23:42 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guix Devel

Hi Amirouche,

Amirouche Boubekki <amirouche@hypermove.net> skribis:

> On 2018-02-09 18:13, ludovic.courtes@inria.fr wrote:
>> Hi!
>>
>> Amirouche Boubekki <amirouche@hypermove.net> skribis:
>>
>>> tl;dr: Distribution of data and software seems similar.
>>>        Data is more and more important in software and reproducible
>>>        science. Data science ecosystem lakes resources sharing.
>>>        I think guix can help.
>>
>> I think some of us especially Guix-HPC folks are convinced about the
>> usefulness of Guix as one of the tools in the reproducible science
>> toolchain (that was one of the themes of my FOSDEM talk).  :-)
>>
>> Now, whether Guix is the right tool to distribute data, I don’t know.
>> Distributing large amounts of data is a job in itself, and the store
>> isn’t designed for that.  It could quickly become a bottleneck.
>
> What does it mean technically that the store “isn't designed for that”?

There are several potential issues.  One is GC: how convenient is it to
have big datasets subject to GC?  Another one is I/O bottleneck: when
adding a file to the store, you currently do an ‘add-to-store’ RPC to
the daemon, pass it the file name, which the daemon then reads entirely
to compute its content hash; could be an issue with big datasets.

HTH,
Ludo’.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Use guix to distribute data & reproducible (data) science
  2018-02-16 16:43 Amirouche Boubekki
  2018-02-17 22:21 ` Roel Janssen
  2018-02-18 23:42 ` Ludovic Courtès
@ 2018-02-19  7:57 ` Ricardo Wurmus
  2 siblings, 0 replies; 23+ messages in thread
From: Ricardo Wurmus @ 2018-02-19  7:57 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guix Devel, ludovic.courtes

Amirouche Boubekki <amirouche@hypermove.net> writes:

> Then, in a follow up mail, you reply to Konrad:
>
>>> Konrad Hinsen <konrad.hinsen@fastmail.net> skribis:
>>
>> [...]
>>
>>> It would be nice if big datasets could conceptually be handled in the
>>> same way while being stored elsewhere - a bit like git-annex does for
>>> git. And for parallel computing, we could have special build daemons.
>>
>> Exactly.  I think we need a git-annex/git-lfs-like tool for the store.
>> (It could also be useful for things like secrets, which we don’t want
>> to have in the store.)

In addition to the answers by Ludo and Roel, I’d like to add that for
data we have more things that we’d like to know about.  For any given
dataset on storage I’d like to know how it relates to previous versions
of the same dataset.  The hash alone would not be sufficient.  I’d
actually need to know which dataset is the parent and which is a child.

The store does not give me relations like that when given two or more
items.  The store retains information about links between items in one
generation (if they embed such references), but not across generations.

I think the requirements for the storage and retrieval of (big) datasets
are very different to those of software packages.

There are projects dedicated to dataset storage, such as Pachyderm.io.
Since data storage is just a stepping stone to better workflows,
Pachyderm also includes support for application bundles, but it may be
better to let a dedicated workflow language take care of the application
side.

Maybe the GWL can be integrated with dedicated data storage solutions
like Pachyderm.

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-02-19  7:58 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-09 17:13 ` Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen
2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
2018-02-14  6:07           ` Pjotr Prins
2018-02-14  7:27             ` Fis Trivial
2018-02-14  8:04           ` Konrad Hinsen
2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki
  -- strict thread matches above, loose matches on Subject: below --
2018-02-16 16:43 Amirouche Boubekki
2018-02-17 22:21 ` Roel Janssen
2018-02-18 23:42 ` Ludovic Courtès
2018-02-19  7:57 ` Ricardo Wurmus

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.