all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Konrad Hinsen <konrad.hinsen@fastmail.net>
To: Guix Devel <guix-devel@gnu.org>
Subject: Re: Use guix to distribute data & reproducible (data) science
Date: Mon, 12 Feb 2018 12:46:44 +0100	[thread overview]
Message-ID: <m1sha6ig63.fsf@fastmail.net> (raw)
In-Reply-To: <CAJ3okZ2rsr=NYa7U_Af33puNaMsXuCL2jSjh5UMnFUubGqEnaw@mail.gmail.com>

Hi everyone,

zimoun <zimon.toutoune@gmail.com> writes:

> From my point of view, there is 2 kind of datasets:
>  a- the ones which are part of the software, e.g., used to pass the
> tests. Therefore, they are usually small, not always;
>  b- the ones which are applied to the software and somehow they are
> not in the source repository. They are big or not.

I was thinking of the second kind only.

> It does not appear to me a good idea to try to include in the store
> datasets of case b-.
> Is it not the job of data management tools ? e.g., database etc.

It depends. The frontier between data and code is not as clear as it may
seem. An example: the weights of a trained neural network can be seen as
data (a bunch of numbers), but also as code for a special-purpose
processor defined by the neural network.

Starting from that example, consider that the weights of a neural
network are not fundamentally different from fit parameters in other
scientific models. For example the positions of the atoms in a protein
structure. Using the same analogy as for the neural network, these
positions are the code for a special-purpose processor that computes
estimations for the Bragg reflections that are measured in protein
crystallography.

On the other hand, any sufficiently small dataset can be replaced by a
piece of code that defines a literal data structure. Does that turn the
data into code or not?

My point is that trying to define a distinction between data and code is
always arbitrary and rarely of practical interest. I prefer to take a
pragmatic stance and ask the question: what are the advantages and
problems associated with managing some piece of data in the store? And I
suspect that exploring this question for a couple of applications will
lead to new ways to profit from the store.

However, distribution of data is an entirely different question from
managing data on a single machine. I have no idea how well suited the
store is for distributing data (Hydra does exactly that, right?), so
I'll happily listen to the Guix experts.

> It appear to me as a complement of these points ---and personnally, I
> learn some points about the design of GWL--- with this thread:
> https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html

Thanks for the pointer!

>> It would be nice if big datasets could conceptually be handled in the same
>> way while being stored elsewhere - a bit like git-annex does for git. And
>> for parallel computing, we could have special build daemons.
>
> Hum? the point is to add data management a la git-annex to GWL ? Is it ?

At least consider it - I don't know where that will lead.


Amirouche Boubekki <amirouche.boubekki@gmail.com> writes:

>> For big datasets, some other mechanism is required.
>
> Big as in bigger than ram?

Bigger that one is willing to handle as files that are copied between
machines for distribution. For me, personally and at this moment in
time, that's somewhere between 1 GB and 10 GB.

> What I was thinking about, is use guix to distribute data packages
> just like we distribute softwares from pypi. The advantage of using
> guix seems obvious, but apparantly it's not desirable or possible and
> I don't understand why.

I think there are two questions:
 1. Can guix and its current infrastructure be used for data
 distribution?
 2. Can a yet-to-be-designed data distribution infrastructure
 use the guix store as its storage backend?

My understanding from Ludo's comment is that 1) may not be a good idea,
but that still leaves 2) as a topic for exploration.

> And for parallel computing, we could have special build daemons.
>
> That's where OWL comes in?

Exactly.

Konrad.

  parent reply	other threads:[~2018-02-12 11:46 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-09 17:13 ` Ludovic Courtès
2018-02-09 17:48   ` zimoun
2018-02-09 19:15   ` Konrad Hinsen
2018-02-09 23:01     ` zimoun
2018-02-09 23:17       ` Ricardo Wurmus
2018-02-12 11:46       ` Konrad Hinsen [this message]
2018-02-14  4:43         ` Do you use packages in Guix to run neural networks? Fis Trivial
2018-02-14  6:07           ` Pjotr Prins
2018-02-14  7:27             ` Fis Trivial
2018-02-14  8:04           ` Konrad Hinsen
2018-02-10  9:51     ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki
2018-02-10 11:28       ` zimoun
2018-02-14 13:06     ` Ludovic Courtès
2018-02-15 17:10       ` zimoun
2018-02-16  9:28         ` Konrad Hinsen
2018-02-16 14:33           ` myglc2
2018-02-16 15:20             ` Konrad Hinsen
2018-02-16 12:41         ` Amirouche Boubekki
  -- strict thread matches above, loose matches on Subject: below --
2018-02-16 16:43 Amirouche Boubekki
2018-02-17 22:21 ` Roel Janssen
2018-02-18 23:42 ` Ludovic Courtès
2018-02-19  7:57 ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1sha6ig63.fsf@fastmail.net \
    --to=konrad.hinsen@fastmail.net \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.