all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Use guix to distribute data & reproducible (data) science
@ 2018-02-09 16:32 Amirouche Boubekki
  2018-02-09 17:13 ` Ludovic Courtès
  0 siblings, 1 reply; 19+ messages in thread
From: Amirouche Boubekki @ 2018-02-09 16:32 UTC (permalink / raw)
  To: Guix Devel

Héllo all,

tl;dr: Distribution of data and software seems similar.
        Data is more and more important in software and reproducible
        science. Data science ecosystem lakes resources sharing.
        I think guix can help.

Recently I stumbled upon open data movement and its links with
data science.

To give a high level overview, there is several (web) platforms
that allows administrations and companies to publish data and
_distribute_ it. Example of such platforms are data.gouv.fr [1] and
various other platforms based on CKAN [2].

[1] https://www.data.gouv.fr/
[2] https://okfn.org/projects/

I have worked with data.gouv.fr in particular. And the repository
is rather poor in terms of quality. Making very difficult to use.

The other side of this open data and data based software is the
fact that some software provide their own mechanism to _distribute_
data or binary blobs called 'models' that are sometime based on
libre data. Example of such softwares are spacy [2], gensim [3],
nltk [4] and word2vec.

[2] https://spacy.io/
[3] https://en.wikipedia.org/wiki/Gensim
[4] http://www.nltk.org/

My last point is that it's common knowledge that data wrangling
aka. cleaning and preparing data is 80% of data scientist job.
It's required because data distributors don't do it right, because
they don't have the man power and the knowledge to do it right.

To summarize:

1) Some software and platforms distribute _data_ themselves in some
    "closed garden" way. It's not the role of software to distribute
    data especially when that data can be reused in other contexts.

2) models are binary blobs that you use in the hope they do what they
    are supposed to do. How do you build the model? Is the model
    reproducible?

3) Preparing data must be re-done all the time, let's share resource
    and do it once.

It seems to me that guix has all the required feature to handle data
and models distribution.

What do people think? Do we already use guix to distribute data and 
models.

Also, it seems good to surf on AI frenzy ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-02-16 15:21 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-09 17:13 ` Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen
2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
2018-02-14  6:07           ` Pjotr Prins
2018-02-14  7:27             ` Fis Trivial
2018-02-14  8:04           ` Konrad Hinsen
2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.