all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Amirouche Boubekki <amirouche.boubekki@gmail.com>
To: zimoun <zimon.toutoune@gmail.com>
Cc: Guix Devel <guix-devel@gnu.org>
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Fri, 16 Feb 2018 12:41:18 +0000	[thread overview]
Message-ID: <CAL7_Mo-CrBFzDF49aexuindEKPfevrC2=rk-0oFX_U2qwt_Ycg@mail.gmail.com> (raw)
In-Reply-To: <CAJ3okZ14RAUU4FrBMEOpOLU9aV21BJbY1G5-WeZq3Q-TEmk1Hg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3423 bytes --]

On Thu, Feb 15, 2018 at 6:11 PM zimoun <zimon.toutoune@gmail.com> wrote:

> Hi,
>
> Thank you for this food for thought.
>
>
> I agree that the frontier between code and data is arbitary.
>
> However, I am not sure to get the picture about the data management in
> the context of Reproducible Science. What is the issue ?
>
> So, I catch your invitation to explore your idea. :-)
>

[...]


> For me, just talking about code, it is not a straightforward task to
> define what are the properties for a reproducible and fully controled
> computational environment. It is --I guess-- what Guix is defining
> (transactional, user-profile, hackable, etc.). Then, it appears to me
> even more difficult about data.
>


> What are such properties for data management ?
>

In other words, on the paper, what are the benefits of a management of
> some piece of data in the store ? For example for the applications of
> weights of a trained neural network; or of the positions of the atoms in
> protein structure.
>

Given version-ed datasets you could want to switch
the input dataset of a given "pipeline" to see how different data
produce different results.

Also, it is desirable to be able to re-start a "pipeline" when a
datasets is updated.

For me --maybe I have wrong-- the way is to define a package (or
> workflow) that fetches the data from some external source, cleans if
> needed, does some checks, and then puts it to /path/to/somewhere/
> outside the store. In parallel computing, this /path/to/somewhere/ is
> accessible by all the nodes. Moreover, this /path/to/somewhere/ contains
> something hash-based in the folder name.
>
> Is it not enough ?
>

It is not enough, you could need to make a diff between two
datasets which is not easily done if the data is stored in tarballs.
But that is not a case that can be handled by guix.

Why do you need the history of changes ? as git provide ?
>

Because, if the dataset introduce a change that is not handled by
the rest of the code you can get to know it by looking up the diff. For
instance, a column that is an enumeration of three values that has now a
fourth. But again, it's not a case that's meant to be handled by guix.

Like others have said, there is different kind of data and - even if it was
possible to handle large datasets in guix store - it would require also a
lot of space and computation power. ConceptNet 5.5.5 is 10G which takes
more than a dozen of hours to build and AFAIK is not reproducible since it
takes its input directly on live instance of the wiktionary. WikiData is
100G, but requires no processing power. Those are structured data that you
could want to version in something like git. But, things like spacy models
<https://spacy.io/models/en> that are around 1G take I guess around a few
hours to build, are not structured. Those are the data that I know about
and there very few of them compared to small datasets (see data.gouv.fr)

I think they are various opportunities around reproducible data science. In
particular, I see two main opportunities :

a) Packaging data and work with upstream to keep the data clean. A work
that is already handled by privaters and in initiatives like
http://datahub.io/

b) Cooperate around the making of datasets, some kind of *git for data* for
which there is few or no initiatives. FWIW, I started such a project I call
neon <http://www.hyperdev.fr/projects/neon/>.

Thanks for the feedback.

[-- Attachment #2: Type: text/html, Size: 4880 bytes --]

  parent reply	other threads:[~2018-02-16 12:41 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-09 17:13 ` Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen
2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
2018-02-14  6:07           ` Pjotr Prins
2018-02-14  7:27             ` Fis Trivial
2018-02-14  8:04           ` Konrad Hinsen
2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki [this message]
  -- strict thread matches above, loose matches on Subject: below --
2018-02-16 16:43 Amirouche Boubekki
2018-02-17 22:21 ` Roel Janssen
2018-02-18 23:42 ` Ludovic Courtès
2018-02-19  7:57 ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL7_Mo-CrBFzDF49aexuindEKPfevrC2=rk-0oFX_U2qwt_Ycg@mail.gmail.com' \
    --to=amirouche.boubekki@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.