all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Amirouche Boubekki <amirouche@hypermove.net>
To: Guix Devel <guix-devel@gnu.org>
Subject: Use guix to distribute data & reproducible (data) science
Date: Fri, 09 Feb 2018 17:32:32 +0100	[thread overview]
Message-ID: <365e13248634ac1e26cf6678611d550d@hypermove.net> (raw)

Héllo all,

tl;dr: Distribution of data and software seems similar.
        Data is more and more important in software and reproducible
        science. Data science ecosystem lakes resources sharing.
        I think guix can help.

Recently I stumbled upon open data movement and its links with
data science.

To give a high level overview, there is several (web) platforms
that allows administrations and companies to publish data and
_distribute_ it. Example of such platforms are data.gouv.fr [1] and
various other platforms based on CKAN [2].

[1] https://www.data.gouv.fr/
[2] https://okfn.org/projects/

I have worked with data.gouv.fr in particular. And the repository
is rather poor in terms of quality. Making very difficult to use.

The other side of this open data and data based software is the
fact that some software provide their own mechanism to _distribute_
data or binary blobs called 'models' that are sometime based on
libre data. Example of such softwares are spacy [2], gensim [3],
nltk [4] and word2vec.

[2] https://spacy.io/
[3] https://en.wikipedia.org/wiki/Gensim
[4] http://www.nltk.org/

My last point is that it's common knowledge that data wrangling
aka. cleaning and preparing data is 80% of data scientist job.
It's required because data distributors don't do it right, because
they don't have the man power and the knowledge to do it right.

To summarize:

1) Some software and platforms distribute _data_ themselves in some
    "closed garden" way. It's not the role of software to distribute
    data especially when that data can be reused in other contexts.

2) models are binary blobs that you use in the hope they do what they
    are supposed to do. How do you build the model? Is the model
    reproducible?

3) Preparing data must be re-done all the time, let's share resource
    and do it once.

It seems to me that guix has all the required feature to handle data
and models distribution.

What do people think? Do we already use guix to distribute data and 
models.

Also, it seems good to surf on AI frenzy ;)

             reply	other threads:[~2018-02-09 16:32 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-09 16:32 Amirouche Boubekki [this message]
2018-02-09 17:13 ` Use guix to distribute data & reproducible (data) science Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen
2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
2018-02-14  6:07           ` Pjotr Prins
2018-02-14  7:27             ` Fis Trivial
2018-02-14  8:04           ` Konrad Hinsen
2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki
  -- strict thread matches above, loose matches on Subject: below --
2018-02-16 16:43 Amirouche Boubekki
2018-02-17 22:21 ` Roel Janssen
2018-02-18 23:42 ` Ludovic Courtès
2018-02-19  7:57 ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=365e13248634ac1e26cf6678611d550d@hypermove.net \
    --to=amirouche@hypermove.net \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.