all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Re: Use guix to distribute data & reproducible (data) science
@ 2018-02-16 16:43 Amirouche Boubekki
  2018-02-17 22:21 ` Roel Janssen
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Amirouche Boubekki @ 2018-02-16 16:43 UTC (permalink / raw)
  To: ludovic.courtes; +Cc: Guix Devel

Hello again Ludovic,

On 2018-02-09 18:13, ludovic.courtes@inria.fr wrote:
> Hi!
> 
> Amirouche Boubekki <amirouche@hypermove.net> skribis:
> 
>> tl;dr: Distribution of data and software seems similar.
>>        Data is more and more important in software and reproducible
>>        science. Data science ecosystem lakes resources sharing.
>>        I think guix can help.
> 
> I think some of us especially Guix-HPC folks are convinced about the
> usefulness of Guix as one of the tools in the reproducible science
> toolchain (that was one of the themes of my FOSDEM talk).  :-)
> 
> Now, whether Guix is the right tool to distribute data, I don’t know.
> Distributing large amounts of data is a job in itself, and the store
> isn’t designed for that.  It could quickly become a bottleneck.

What does it mean technically that the store “isn't designed for that”?

> That’s one of the reasons why the Guix Workflow Language (GWL)
> does not store scientific data in the store itself.

Sorry, I did not follow the engineering discussion around GWL.
Looking up the web brings me [0]. That said the question I am
asking is not answered there. In particular there is no rationale
for that in the design paper.

[0] http://lists.gnu.org/archive/html/guix-devel/2016-10/msg01248.html

> I think data should probably be stored and distributed out-of-band 
> using
> appropriate storage mechanisms.

Then, in a follow up mail, you reply to Konrad:

>> Konrad Hinsen <konrad.hinsen@fastmail.net> skribis:
> 
> [...]
> 
>> It would be nice if big datasets could conceptually be handled in the
>> same way while being stored elsewhere - a bit like git-annex does for
>> git. And for parallel computing, we could have special build daemons.
> 
> Exactly.  I think we need a git-annex/git-lfs-like tool for the store.
> (It could also be useful for things like secrets, which we don’t want
> to have in the store.)
> 


-

" The most basic of all human needs is the need to understand and be 
understood " Ralph Nichols

^ permalink raw reply	[flat|nested] 19+ messages in thread
* Use guix to distribute data & reproducible (data) science
@ 2018-02-09 16:32 Amirouche Boubekki
  2018-02-09 17:13 ` Ludovic Courtès
  0 siblings, 1 reply; 19+ messages in thread
From: Amirouche Boubekki @ 2018-02-09 16:32 UTC (permalink / raw)
  To: Guix Devel

Héllo all,

tl;dr: Distribution of data and software seems similar.
        Data is more and more important in software and reproducible
        science. Data science ecosystem lakes resources sharing.
        I think guix can help.

Recently I stumbled upon open data movement and its links with
data science.

To give a high level overview, there is several (web) platforms
that allows administrations and companies to publish data and
_distribute_ it. Example of such platforms are data.gouv.fr [1] and
various other platforms based on CKAN [2].

[1] https://www.data.gouv.fr/
[2] https://okfn.org/projects/

I have worked with data.gouv.fr in particular. And the repository
is rather poor in terms of quality. Making very difficult to use.

The other side of this open data and data based software is the
fact that some software provide their own mechanism to _distribute_
data or binary blobs called 'models' that are sometime based on
libre data. Example of such softwares are spacy [2], gensim [3],
nltk [4] and word2vec.

[2] https://spacy.io/
[3] https://en.wikipedia.org/wiki/Gensim
[4] http://www.nltk.org/

My last point is that it's common knowledge that data wrangling
aka. cleaning and preparing data is 80% of data scientist job.
It's required because data distributors don't do it right, because
they don't have the man power and the knowledge to do it right.

To summarize:

1) Some software and platforms distribute _data_ themselves in some
    "closed garden" way. It's not the role of software to distribute
    data especially when that data can be reused in other contexts.

2) models are binary blobs that you use in the hope they do what they
    are supposed to do. How do you build the model? Is the model
    reproducible?

3) Preparing data must be re-done all the time, let's share resource
    and do it once.

It seems to me that guix has all the required feature to handle data
and models distribution.

What do people think? Do we already use guix to distribute data and 
models.

Also, it seems good to surf on AI frenzy ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-02-19  7:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-16 16:43 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-17 22:21 ` Roel Janssen
2018-02-18 23:42 ` Ludovic Courtès
2018-02-19  7:57 ` Ricardo Wurmus
  -- strict thread matches above, loose matches on Subject: below --
2018-02-09 16:32 Amirouche Boubekki
2018-02-09 17:13 ` Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen
2018-02-10  9:51     ` Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.