Re: Use guix to distribute data & reproducible (data) science

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Roel Janssen <roel@gnu.org>
To: Amirouche Boubekki <amirouche@hypermove.net>
Cc: Guix Devel <guix-devel@gnu.org>, ludovic.courtes@inria.fr
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Sat, 17 Feb 2018 23:21:10 +0100	[thread overview]
Message-ID: <874lmfmf55.fsf@gnu.org> (raw)
In-Reply-To: <24274adb01ba9c928a4701054b686a4a@hypermove.net>


Amirouche Boubekki writes:

> Hello again Ludovic,
>
> On 2018-02-09 18:13, ludovic.courtes@inria.fr wrote:
>> Hi!
>> 
>> Amirouche Boubekki <amirouche@hypermove.net> skribis:
>> 
>>> tl;dr: Distribution of data and software seems similar.
>>>        Data is more and more important in software and reproducible
>>>        science. Data science ecosystem lakes resources sharing.
>>>        I think guix can help.
>> 
>> I think some of us especially Guix-HPC folks are convinced about the
>> usefulness of Guix as one of the tools in the reproducible science
>> toolchain (that was one of the themes of my FOSDEM talk).  :-)
>> 
>> Now, whether Guix is the right tool to distribute data, I don’t know.
>> Distributing large amounts of data is a job in itself, and the store
>> isn’t designed for that.  It could quickly become a bottleneck.
>
> What does it mean technically that the store “isn't designed for that”?
>
>> That’s one of the reasons why the Guix Workflow Language (GWL)
>> does not store scientific data in the store itself.
>
> Sorry, I did not follow the engineering discussion around GWL.
> Looking up the web brings me [0]. That said the question I am
> asking is not answered there. In particular there is no rationale
> for that in the design paper.
>
> [0] http://lists.gnu.org/archive/html/guix-devel/2016-10/msg01248.html
>
>> I think data should probably be stored and distributed out-of-band 
>> using
>> appropriate storage mechanisms.
>
> Then, in a follow up mail, you reply to Konrad:
>
>>> Konrad Hinsen <konrad.hinsen@fastmail.net> skribis:
>> 
>> [...]
>> 
>>> It would be nice if big datasets could conceptually be handled in the
>>> same way while being stored elsewhere - a bit like git-annex does for
>>> git. And for parallel computing, we could have special build daemons.
>> 
>> Exactly.  I think we need a git-annex/git-lfs-like tool for the store.
>> (It could also be useful for things like secrets, which we don’t want
>> to have in the store.)
>> 

To answer your question:
> What does it mean technically that the store “isn't designed for that”?

I speak only from my own experience with “big data sets”, so may be it
is different for other people, but we use a separate storage system for
storing large amounts of data.  This separate storage is fault-tolerant
and is optimized for large files, meaning higher latency for file access
to reduce the financial footprint of such a system.

If we were to put data inside the store, we would need to optimize the
storage system for both low latency for small files, and a high storage
capacity.  This is extremely expensive.

Another issue I faced when providing datasets in the store is that
it's quite easy to end up with duplicated copies of the same dataset.

For example, I use the GNU build system for extracting a tarball that
contains a couple of files.  Whenever a package changes that affects the
GNU build system, the data package will be rebuild.

So you could use the trivial build system, but then I'd still need tar
and gzip to unpack the tarball.  Any change to these and the datasets
get duplicated.  This is not ideal.

Kind regards,
Roel Janssen

next prev parent reply	other threads:[~2018-02-17 22:21 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-16 16:43 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-17 22:21 ` Roel Janssen [this message]
2018-02-18 23:42 ` Ludovic Courtès
2018-02-19  7:57 ` Ricardo Wurmus
  -- strict thread matches above, loose matches on Subject: below --
2018-02-09 16:32 Amirouche Boubekki
2018-02-09 17:13 ` Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen
2018-02-10  9:51     ` Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874lmfmf55.fsf@gnu.org \
    --to=roel@gnu.org \
    --cc=amirouche@hypermove.net \
    --cc=guix-devel@gnu.org \
    --cc=ludovic.courtes@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.