* Use guix to distribute data & reproducible (data) science @ 2018-02-09 16:32 Amirouche Boubekki 2018-02-09 17:13 ` Ludovic Courtès 0 siblings, 1 reply; 19+ messages in thread From: Amirouche Boubekki @ 2018-02-09 16:32 UTC (permalink / raw) To: Guix Devel Héllo all, tl;dr: Distribution of data and software seems similar. Data is more and more important in software and reproducible science. Data science ecosystem lakes resources sharing. I think guix can help. Recently I stumbled upon open data movement and its links with data science. To give a high level overview, there is several (web) platforms that allows administrations and companies to publish data and _distribute_ it. Example of such platforms are data.gouv.fr [1] and various other platforms based on CKAN [2]. [1] https://www.data.gouv.fr/ [2] https://okfn.org/projects/ I have worked with data.gouv.fr in particular. And the repository is rather poor in terms of quality. Making very difficult to use. The other side of this open data and data based software is the fact that some software provide their own mechanism to _distribute_ data or binary blobs called 'models' that are sometime based on libre data. Example of such softwares are spacy [2], gensim [3], nltk [4] and word2vec. [2] https://spacy.io/ [3] https://en.wikipedia.org/wiki/Gensim [4] http://www.nltk.org/ My last point is that it's common knowledge that data wrangling aka. cleaning and preparing data is 80% of data scientist job. It's required because data distributors don't do it right, because they don't have the man power and the knowledge to do it right. To summarize: 1) Some software and platforms distribute _data_ themselves in some "closed garden" way. It's not the role of software to distribute data especially when that data can be reused in other contexts. 2) models are binary blobs that you use in the hope they do what they are supposed to do. How do you build the model? Is the model reproducible? 3) Preparing data must be re-done all the time, let's share resource and do it once. It seems to me that guix has all the required feature to handle data and models distribution. What do people think? Do we already use guix to distribute data and models. Also, it seems good to surf on AI frenzy ;) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki @ 2018-02-09 17:13 ` Ludovic Courtès 2018-02-09 17:48 ` zimoun 2018-02-09 19:15 ` Konrad Hinsen 0 siblings, 2 replies; 19+ messages in thread From: Ludovic Courtès @ 2018-02-09 17:13 UTC (permalink / raw) To: Amirouche Boubekki; +Cc: Guix Devel Hi! Amirouche Boubekki <amirouche@hypermove.net> skribis: > tl;dr: Distribution of data and software seems similar. > Data is more and more important in software and reproducible > science. Data science ecosystem lakes resources sharing. > I think guix can help. I think some of us especially Guix-HPC folks are convinced about the usefulness of Guix as one of the tools in the reproducible science toolchain (that was one of the themes of my FOSDEM talk). :-) Now, whether Guix is the right tool to distribute data, I don’t know. Distributing large amounts of data is a job in itself, and the store isn’t designed for that. It could quickly become a bottleneck. That’s one of the reasons why the Guix Workflow Language (GWL) does not store scientific data in the store itself. I think data should probably be stored and distributed out-of-band using appropriate storage mechanisms. Ludo’. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 17:13 ` Ludovic Courtès @ 2018-02-09 17:48 ` zimoun 2018-02-09 19:15 ` Konrad Hinsen 1 sibling, 0 replies; 19+ messages in thread From: zimoun @ 2018-02-09 17:48 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Guix Devel, Amirouche Boubekki Dear, From my understanding, what you are describing is what bioinfo guys call a workflow: 1- fetch data here and there 2- clean and prepare data 3- compute stuff with these data 4- obtain an answer and loop several times on several data sets. Guix Workflow Language allows to implement the workflow, i.e., all the steps and their link to deal with the data. And because Guix, reproducibility in terms of softwares comes for almost free. Moreover, if there is some channel mechanism, then there is a way to share these workflows. I think the tools are there, modulo UI and corner cases. :-) From my point of view, workflows are missing because of manpower (lispy guy, etc.). Last, a workflow is not necessary reproducible bit-to-bit since some algorithms use randomness. Hope that helps. All the best, simon On 9 February 2018 at 18:13, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > Hi! > > Amirouche Boubekki <amirouche@hypermove.net> skribis: > >> tl;dr: Distribution of data and software seems similar. >> Data is more and more important in software and reproducible >> science. Data science ecosystem lakes resources sharing. >> I think guix can help. > > I think some of us especially Guix-HPC folks are convinced about the > usefulness of Guix as one of the tools in the reproducible science > toolchain (that was one of the themes of my FOSDEM talk). :-) > > Now, whether Guix is the right tool to distribute data, I don’t know. > Distributing large amounts of data is a job in itself, and the store > isn’t designed for that. It could quickly become a bottleneck. That’s > one of the reasons why the Guix Workflow Language (GWL) does not store > scientific data in the store itself. > > I think data should probably be stored and distributed out-of-band using > appropriate storage mechanisms. > > Ludo’. > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 17:13 ` Ludovic Courtès 2018-02-09 17:48 ` zimoun @ 2018-02-09 19:15 ` Konrad Hinsen 2018-02-09 23:01 ` zimoun ` (2 more replies) 1 sibling, 3 replies; 19+ messages in thread From: Konrad Hinsen @ 2018-02-09 19:15 UTC (permalink / raw) To: guix-devel Hi, On 09/02/2018 18:13, Ludovic Courtès wrote: > Amirouche Boubekki <amirouche@hypermove.net> skribis: > >> tl;dr: Distribution of data and software seems similar. >> Data is more and more important in software and reproducible >> science. Data science ecosystem lakes resources sharing. >> I think guix can help. > > Now, whether Guix is the right tool to distribute data, I don’t know. > Distributing large amounts of data is a job in itself, and the store > isn’t designed for that. It could quickly become a bottleneck. That’s > one of the reasons why the Guix Workflow Language (GWL) does not store > scientific data in the store itself. I'd say it depends on the data and how it is used inside and outside of a workflow. Some data could very well stored in the store, and then distributed via standard channels (Zenodo, ...) after export by "guix pack". For big datasets, some other mechanism is required. I think it's worth thinking carefully about how to exploit guix for reproducible computations. As Lispers know very well, code is data and data is code. Building a package is a computation like any other. Scientific workflows could be handled by a specific build system. In fact, as long as no big datasets or multiple processors are involved, we can do this right now, using standard package declarations. It would be nice if big datasets could conceptually be handled in the same way while being stored elsewhere - a bit like git-annex does for git. And for parallel computing, we could have special build daemons. Konrad. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 19:15 ` Konrad Hinsen @ 2018-02-09 23:01 ` zimoun 2018-02-09 23:17 ` Ricardo Wurmus 2018-02-12 11:46 ` Konrad Hinsen 2018-02-10 9:51 ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki 2018-02-14 13:06 ` Ludovic Courtès 2 siblings, 2 replies; 19+ messages in thread From: zimoun @ 2018-02-09 23:01 UTC (permalink / raw) To: Konrad Hinsen; +Cc: Guix Devel Hi, > I'd say it depends on the data and how it is used inside and outside of a > workflow. Some data could very well stored in the store, and then > distributed via standard channels (Zenodo, ...) after export by "guix pack". > For big datasets, some other mechanism is required. I am not sure to understand the point. From my point of view, there is 2 kind of datasets: a- the ones which are part of the software, e.g., used to pass the tests. Therefore, they are usually small, not always; b- the ones which are applied to the software and somehow they are not in the source repository. They are big or not. I do not know if some policy is established in guix about the case a-, not sure that it is possible in fact (e.g., include Whole Genome fasta to test alignment tools ? etc.). It does not appear to me a good idea to try to include in the store datasets of case b-. Is it not the job of data management tools ? e.g., database etc. I do not know so much, but a idea should to write a workflow: you fetch the data, you clean them and you check by hashing that the result is the expected one. Only the softwares used to do that are in the store. The input and output data are not, but your workflow check that they are the expected ones. However, it depends on what we are calling 'cleaning' because some algorithms are not deterministic. Hum? I do not know if there is some mechanism in GWL to check the hash of the `data-inputs' field. > I think it's worth thinking carefully about how to exploit guix for > reproducible computations. As Lispers know very well, code is data and data > is code. Building a package is a computation like any other. Scientific > workflows could be handled by a specific build system. In fact, as long as > no big datasets or multiple processors are involved, we can do this right > now, using standard package declarations. It appear to me as a complement of these points ---and personnally, I learn some points about the design of GWL--- with this thread: https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html > It would be nice if big datasets could conceptually be handled in the same > way while being stored elsewhere - a bit like git-annex does for git. And > for parallel computing, we could have special build daemons. Hum? the point is to add data management a la git-annex to GWL ? Is it ? Have a nice week-end ! simon ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 23:01 ` zimoun @ 2018-02-09 23:17 ` Ricardo Wurmus 2018-02-12 11:46 ` Konrad Hinsen 1 sibling, 0 replies; 19+ messages in thread From: Ricardo Wurmus @ 2018-02-09 23:17 UTC (permalink / raw) To: zimoun; +Cc: Guix Devel zimoun <zimon.toutoune@gmail.com> writes: > I do not know so much, but a idea should to write a workflow: you > fetch the data, you clean them and you check by hashing that the > result is the expected one. Only the softwares used to do that are in > the store. The input and output data are not, but your workflow check > that they are the expected ones. > However, it depends on what we are calling 'cleaning' because some > algorithms are not deterministic. > > Hum? I do not know if there is some mechanism in GWL to check the hash > of the `data-inputs' field. In the GWL the data-inputs field is not special as far as any of the current execution engines are concerned. It’s up to the execution engine to implement recency checks or identity checks as there is not one size that fits all inputs. -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 23:01 ` zimoun 2018-02-09 23:17 ` Ricardo Wurmus @ 2018-02-12 11:46 ` Konrad Hinsen 2018-02-14 4:43 ` Do you use packages in Guix to run neural networks? Fis Trivial 1 sibling, 1 reply; 19+ messages in thread From: Konrad Hinsen @ 2018-02-12 11:46 UTC (permalink / raw) To: Guix Devel Hi everyone, zimoun <zimon.toutoune@gmail.com> writes: > From my point of view, there is 2 kind of datasets: > a- the ones which are part of the software, e.g., used to pass the > tests. Therefore, they are usually small, not always; > b- the ones which are applied to the software and somehow they are > not in the source repository. They are big or not. I was thinking of the second kind only. > It does not appear to me a good idea to try to include in the store > datasets of case b-. > Is it not the job of data management tools ? e.g., database etc. It depends. The frontier between data and code is not as clear as it may seem. An example: the weights of a trained neural network can be seen as data (a bunch of numbers), but also as code for a special-purpose processor defined by the neural network. Starting from that example, consider that the weights of a neural network are not fundamentally different from fit parameters in other scientific models. For example the positions of the atoms in a protein structure. Using the same analogy as for the neural network, these positions are the code for a special-purpose processor that computes estimations for the Bragg reflections that are measured in protein crystallography. On the other hand, any sufficiently small dataset can be replaced by a piece of code that defines a literal data structure. Does that turn the data into code or not? My point is that trying to define a distinction between data and code is always arbitrary and rarely of practical interest. I prefer to take a pragmatic stance and ask the question: what are the advantages and problems associated with managing some piece of data in the store? And I suspect that exploring this question for a couple of applications will lead to new ways to profit from the store. However, distribution of data is an entirely different question from managing data on a single machine. I have no idea how well suited the store is for distributing data (Hydra does exactly that, right?), so I'll happily listen to the Guix experts. > It appear to me as a complement of these points ---and personnally, I > learn some points about the design of GWL--- with this thread: > https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html Thanks for the pointer! >> It would be nice if big datasets could conceptually be handled in the same >> way while being stored elsewhere - a bit like git-annex does for git. And >> for parallel computing, we could have special build daemons. > > Hum? the point is to add data management a la git-annex to GWL ? Is it ? At least consider it - I don't know where that will lead. Amirouche Boubekki <amirouche.boubekki@gmail.com> writes: >> For big datasets, some other mechanism is required. > > Big as in bigger than ram? Bigger that one is willing to handle as files that are copied between machines for distribution. For me, personally and at this moment in time, that's somewhere between 1 GB and 10 GB. > What I was thinking about, is use guix to distribute data packages > just like we distribute softwares from pypi. The advantage of using > guix seems obvious, but apparantly it's not desirable or possible and > I don't understand why. I think there are two questions: 1. Can guix and its current infrastructure be used for data distribution? 2. Can a yet-to-be-designed data distribution infrastructure use the guix store as its storage backend? My understanding from Ludo's comment is that 1) may not be a good idea, but that still leaves 2) as a topic for exploration. > And for parallel computing, we could have special build daemons. > > That's where OWL comes in? Exactly. Konrad. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Do you use packages in Guix to run neural networks? 2018-02-12 11:46 ` Konrad Hinsen @ 2018-02-14 4:43 ` Fis Trivial 2018-02-14 6:07 ` Pjotr Prins 2018-02-14 8:04 ` Konrad Hinsen 0 siblings, 2 replies; 19+ messages in thread From: Fis Trivial @ 2018-02-14 4:43 UTC (permalink / raw) To: Konrad Hinsen; +Cc: Guix Devel > > It depends. The frontier between data and code is not as clear as it may > seem. An example: the weights of a trained neural network can be seen as > data (a bunch of numbers), but also as code for a special-purpose > processor defined by the neural network. > > Starting from that example, consider that the weights of a neural > network are not fundamentally different from fit parameters in other > scientific models. For example the positions of the atoms in a protein > structure. Using the same analogy as for the neural network, these > positions are the code for a special-purpose processor that computes > estimations for the Bragg reflections that are measured in protein > crystallography. > Sorry for bothering with a completely unrelated topic. I'm curious do you train neural network with packages in Guix? Or did you packaged related libraries yourself? I would love to let guix handle all machine learning libraries for me, but many of them, especially those nn libraries, require GPU to accelerate the speed. But, AFAIK, currently there are no GPU compute stack can meet GNU's standard. Unable to put it upstream somehow makes me lazy to package them. Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Do you use packages in Guix to run neural networks? 2018-02-14 4:43 ` Do you use packages in Guix to run neural networks? Fis Trivial @ 2018-02-14 6:07 ` Pjotr Prins 2018-02-14 7:27 ` Fis Trivial 2018-02-14 8:04 ` Konrad Hinsen 1 sibling, 1 reply; 19+ messages in thread From: Pjotr Prins @ 2018-02-14 6:07 UTC (permalink / raw) To: Fis Trivial; +Cc: Guix Devel On Wed, Feb 14, 2018 at 04:43:50AM +0000, Fis Trivial wrote: > Sorry for bothering with a completely unrelated topic. > I'm curious do you train neural network with packages in Guix? Or did > you packaged related libraries yourself? > > I would love to let guix handle all machine learning libraries for me, > but many of them, especially those nn libraries, require GPU to > accelerate the speed. But, AFAIK, currently there are no GPU compute > stack can meet GNU's standard. Unable to put it upstream somehow makes > me lazy to package them. Dennis did some work packaging opencl and arrayfire. It is in here: https://gitlab.com/genenetwork/guix-bioinformatics/tree/master/gn/packages and may be a reasonable starting point. For the propriety stuff, it won't make it into trunk, but you can still package it and share. Pj. -- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Do you use packages in Guix to run neural networks? 2018-02-14 6:07 ` Pjotr Prins @ 2018-02-14 7:27 ` Fis Trivial 0 siblings, 0 replies; 19+ messages in thread From: Fis Trivial @ 2018-02-14 7:27 UTC (permalink / raw) To: Pjotr Prins; +Cc: Guix Devel > On Wed, Feb 14, 2018 at 04:43:50AM +0000, Fis Trivial wrote: >> Sorry for bothering with a completely unrelated topic. >> I'm curious do you train neural network with packages in Guix? Or did >> you packaged related libraries yourself? >> >> I would love to let guix handle all machine learning libraries for me, >> but many of them, especially those nn libraries, require GPU to >> accelerate the speed. But, AFAIK, currently there are no GPU compute >> stack can meet GNU's standard. Unable to put it upstream somehow makes >> me lazy to package them. > > Dennis did some work packaging opencl and arrayfire. It is in here: > > https://gitlab.com/genenetwork/guix-bioinformatics/tree/master/gn/packages > > and may be a reasonable starting point. > > For the propriety stuff, it won't make it into trunk, but you can > still package it and share. > > Pj. Thanks. It will take some time to work things out, but this is starting point. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Do you use packages in Guix to run neural networks? 2018-02-14 4:43 ` Do you use packages in Guix to run neural networks? Fis Trivial 2018-02-14 6:07 ` Pjotr Prins @ 2018-02-14 8:04 ` Konrad Hinsen 1 sibling, 0 replies; 19+ messages in thread From: Konrad Hinsen @ 2018-02-14 8:04 UTC (permalink / raw) To: Fis Trivial, Guix Devel On 14/02/2018 05:43, Fis Trivial wrote: > Sorry for bothering with a completely unrelated topic. > I'm curious do you train neural network with packages in Guix? Or did > you packaged related libraries yourself? My needs are modest, all I use is the multilayer perceptron from scikit-learn, which is already packaged in Guix. Konrad. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 19:15 ` Konrad Hinsen 2018-02-09 23:01 ` zimoun @ 2018-02-10 9:51 ` Amirouche Boubekki 2018-02-10 11:28 ` zimoun 2018-02-14 13:06 ` Ludovic Courtès 2 siblings, 1 reply; 19+ messages in thread From: Amirouche Boubekki @ 2018-02-10 9:51 UTC (permalink / raw) To: Konrad Hinsen; +Cc: guix-devel [-- Attachment #1: Type: text/plain, Size: 1979 bytes --] On Fri, Feb 9, 2018 at 8:16 PM Konrad Hinsen <konrad.hinsen@fastmail.net> wrote: > Hi, > > On 09/02/2018 18:13, Ludovic Courtès wrote: > > > Amirouche Boubekki <amirouche@hypermove.net> skribis: > > > >> tl;dr: Distribution of data and software seems similar. > >> Data is more and more important in software and reproducible > >> science. Data science ecosystem lakes resources sharing. > >> I think guix can help. > > > > Now, whether Guix is the right tool to distribute data, I don’t know. > > Distributing large amounts of data is a job in itself, and the store > > isn’t designed for that. It could quickly become a bottleneck. That’s > > one of the reasons why the Guix Workflow Language (GWL) does not store > > scientific data in the store itself. > > and then distributed via standard channels (Zenodo, ...) Thanks for the pointer! > For big datasets, some other mechanism is required. > Big as in bigger than ram? > I think it's worth thinking carefully about how to exploit guix for > reproducible computations. As Lispers know very well, code is data and > data is code. Building a package is a computation like any other. > What I was thinking about, is use guix to distribute data packages just like we distribute softwares from pypi. The advantage of using guix seems obvious, but apparantly it's not desirable or possible and I don't understand why. Scientific workflows could be handled by a specific build system. In > fact, as long as no big datasets or multiple processors are involved, we > can do this right now, using standard package declarations. > Ok, good to know. > It would be nice if big datasets could conceptually be handled in the > same way while being stored elsewhere - a bit like git-annex does for > git. Thanks again for the pointer. And for parallel computing, we could have special build daemons. > That's where OWL comes in? [-- Attachment #2: Type: text/html, Size: 3232 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-10 9:51 ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki @ 2018-02-10 11:28 ` zimoun 0 siblings, 0 replies; 19+ messages in thread From: zimoun @ 2018-02-10 11:28 UTC (permalink / raw) To: Amirouche Boubekki; +Cc: Guix Devel Hi, Thank you for the topic feeding my thoughts. And thank you Ricardo for your explanations. > What I was thinking about, is use guix to distribute data packages just like > we distribute softwares from pypi. The advantage of using guix seems > obvious, > but apparantly it's not desirable or possible and I don't understand why. Are you talking to package a way to fetch the data ? The first Debian example I found: https://packages.debian.org/fr/stretch/astrometry-data-2mass-00 Or to package the dataset itself ? Which does not seem affordable in term of resources (bandwith+disk), is it ? Last, when considering large dataset --say hundred or more samples of GB-- then hashing becomes the bottleneck. All the best, simon ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-09 19:15 ` Konrad Hinsen 2018-02-09 23:01 ` zimoun 2018-02-10 9:51 ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki @ 2018-02-14 13:06 ` Ludovic Courtès 2018-02-15 17:10 ` zimoun 2 siblings, 1 reply; 19+ messages in thread From: Ludovic Courtès @ 2018-02-14 13:06 UTC (permalink / raw) To: Konrad Hinsen; +Cc: guix-devel Hello, Konrad Hinsen <konrad.hinsen@fastmail.net> skribis: > It would be nice if big datasets could conceptually be handled in the > same way while being stored elsewhere - a bit like git-annex does for > git. And for parallel computing, we could have special build daemons. Exactly. I think we need a git-annex/git-lfs-like tool for the store. (It could also be useful for things like secrets, which we don’t want to have in the store.) Ludo’. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-14 13:06 ` Ludovic Courtès @ 2018-02-15 17:10 ` zimoun 2018-02-16 9:28 ` Konrad Hinsen 2018-02-16 12:41 ` Amirouche Boubekki 0 siblings, 2 replies; 19+ messages in thread From: zimoun @ 2018-02-15 17:10 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Guix Devel Hi, Thank you for this food for thought. I agree that the frontier between code and data is arbitary. However, I am not sure to get the picture about the data management in the context of Reproducible Science. What is the issue ? So, I catch your invitation to explore your idea. :-) Let think about the old lab experiment. On one hand, you have your protocol and the description of all the steps. On the other hand, you have measurements and results. Then, I am able to imagine a sense of some bit-to-bit mechanism for the protocol part. I am not sure about the measurements part. Well, protocol is code or workflow; measurements are data. And I agree that e.g., information of electronic orbits or weights of a trained neural network is sometimes part of the protocol. :-) For me, just talking about code, it is not a straightforward task to define what are the properties for a reproducible and fully controled computational environment. It is --I guess-- what Guix is defining (transactional, user-profile, hackable, etc.). Then, it appears to me even more difficult about data. What are such properties for data management ? In other words, on the paper, what are the benefits of a management of some piece of data in the store ? For example for the applications of weights of a trained neural network; or of the positions of the atoms in protein structure. For me --maybe I have wrong-- the way is to define a package (or workflow) that fetches the data from some external source, cleans if needed, does some checks, and then puts it to /path/to/somewhere/ outside the store. In parallel computing, this /path/to/somewhere/ is accessible by all the nodes. Moreover, this /path/to/somewhere/ contains something hash-based in the folder name. Is it not enough ? Why do you need the history of changes ? as git provide ? Secrets is another story than reproducible science toolchain, I guess. Thank you again. All the best, simon ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-15 17:10 ` zimoun @ 2018-02-16 9:28 ` Konrad Hinsen 2018-02-16 14:33 ` myglc2 2018-02-16 12:41 ` Amirouche Boubekki 1 sibling, 1 reply; 19+ messages in thread From: Konrad Hinsen @ 2018-02-16 9:28 UTC (permalink / raw) To: zimoun, Guix Devel Hi, > In other words, on the paper, what are the benefits of a management of > some piece of data in the store ? For example for the applications of > weights of a trained neural network; or of the positions of the atoms in > protein structure. Provenance tracking. In a complex data processing workflow, it is important to know which computations were done in which order using which software. This is technically almost the same as software dependency tracking, so it would be nice to re-use the Guix infrastructure for this. > For me --maybe I have wrong-- the way is to define a package (or > workflow) that fetches the data from some external source, cleans if > needed, does some checks, and then puts it to /path/to/somewhere/ > outside the store. In parallel computing, this /path/to/somewhere/ is > accessible by all the nodes. Moreover, this /path/to/somewhere/ contains > something hash-based in the folder name. > > Is it not enough ? Whether for software or for data, dependencies are DAGs whose terminal nodes are measuremnts (for data) or human-supplied information (code, parameters, methodological choices). Guix handles the latter very well. The three missing pieces are: - Dealing with measurements, which might involve interacting with experimental equipment or databases. Moreover, since data from such sources can change, its hash in the store must be computed from the contents, not just from the reference to the contents. - Dealing with data that is much larger than what Guix handles well in the store. - Exporting results with provenance tracking to the outside world, which may not be using Guix. Big data aside, this could take the form of a new output format to "guix pack", for example a research object (http://www.researchobject.org/) or a Reproducible Document Archive (https://github.com/substance/dar). > Why do you need the history of changes ? as git provide ? I'd say git is fine for everything not "too big". > Secrets is another story than reproducible science toolchain, I guess. Yes, indeed. And not something I need to deal with, so I will shut up now! Konrad. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-16 9:28 ` Konrad Hinsen @ 2018-02-16 14:33 ` myglc2 2018-02-16 15:20 ` Konrad Hinsen 0 siblings, 1 reply; 19+ messages in thread From: myglc2 @ 2018-02-16 14:33 UTC (permalink / raw) To: Konrad Hinsen; +Cc: Guix Devel Hi Konrad, On 02/16/2018 at 10:28 Konrad Hinsen writes: > Whether for software or for data, dependencies are DAGs whose terminal > nodes are measuremnts (for data) or human-supplied information (code, > parameters, methodological choices). Guix handles the latter very well. > > The three missing pieces are: > > - Dealing with measurements, which might involve interacting with > experimental equipment or databases. Moreover, since data from > such sources can change, its hash in the store must be computed > from the contents, not just from the reference to the contents. Why not "enclose" a measurement set and it's provenance in a git "package"? - George ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-16 14:33 ` myglc2 @ 2018-02-16 15:20 ` Konrad Hinsen 0 siblings, 0 replies; 19+ messages in thread From: Konrad Hinsen @ 2018-02-16 15:20 UTC (permalink / raw) To: myglc2, Guix Devel Hi George, myglc2@gmail.com writes: >> The three missing pieces are: >> >> - Dealing with measurements, which might involve interacting with >> experimental equipment or databases. Moreover, since data from >> such sources can change, its hash in the store must be computed >> from the contents, not just from the reference to the contents. > > Why not "enclose" a measurement set and it's provenance in a git > "package"? For text-based data of reasonable size, that's an option. Many people are already using git for data management. But then, data is so diverse that no rule and no tool will satisfy everybody's requirements. Konrad. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Use guix to distribute data & reproducible (data) science 2018-02-15 17:10 ` zimoun 2018-02-16 9:28 ` Konrad Hinsen @ 2018-02-16 12:41 ` Amirouche Boubekki 1 sibling, 0 replies; 19+ messages in thread From: Amirouche Boubekki @ 2018-02-16 12:41 UTC (permalink / raw) To: zimoun; +Cc: Guix Devel [-- Attachment #1: Type: text/plain, Size: 3423 bytes --] On Thu, Feb 15, 2018 at 6:11 PM zimoun <zimon.toutoune@gmail.com> wrote: > Hi, > > Thank you for this food for thought. > > > I agree that the frontier between code and data is arbitary. > > However, I am not sure to get the picture about the data management in > the context of Reproducible Science. What is the issue ? > > So, I catch your invitation to explore your idea. :-) > [...] > For me, just talking about code, it is not a straightforward task to > define what are the properties for a reproducible and fully controled > computational environment. It is --I guess-- what Guix is defining > (transactional, user-profile, hackable, etc.). Then, it appears to me > even more difficult about data. > > What are such properties for data management ? > In other words, on the paper, what are the benefits of a management of > some piece of data in the store ? For example for the applications of > weights of a trained neural network; or of the positions of the atoms in > protein structure. > Given version-ed datasets you could want to switch the input dataset of a given "pipeline" to see how different data produce different results. Also, it is desirable to be able to re-start a "pipeline" when a datasets is updated. For me --maybe I have wrong-- the way is to define a package (or > workflow) that fetches the data from some external source, cleans if > needed, does some checks, and then puts it to /path/to/somewhere/ > outside the store. In parallel computing, this /path/to/somewhere/ is > accessible by all the nodes. Moreover, this /path/to/somewhere/ contains > something hash-based in the folder name. > > Is it not enough ? > It is not enough, you could need to make a diff between two datasets which is not easily done if the data is stored in tarballs. But that is not a case that can be handled by guix. Why do you need the history of changes ? as git provide ? > Because, if the dataset introduce a change that is not handled by the rest of the code you can get to know it by looking up the diff. For instance, a column that is an enumeration of three values that has now a fourth. But again, it's not a case that's meant to be handled by guix. Like others have said, there is different kind of data and - even if it was possible to handle large datasets in guix store - it would require also a lot of space and computation power. ConceptNet 5.5.5 is 10G which takes more than a dozen of hours to build and AFAIK is not reproducible since it takes its input directly on live instance of the wiktionary. WikiData is 100G, but requires no processing power. Those are structured data that you could want to version in something like git. But, things like spacy models <https://spacy.io/models/en> that are around 1G take I guess around a few hours to build, are not structured. Those are the data that I know about and there very few of them compared to small datasets (see data.gouv.fr) I think they are various opportunities around reproducible data science. In particular, I see two main opportunities : a) Packaging data and work with upstream to keep the data clean. A work that is already handled by privaters and in initiatives like http://datahub.io/ b) Cooperate around the making of datasets, some kind of *git for data* for which there is few or no initiatives. FWIW, I started such a project I call neon <http://www.hyperdev.fr/projects/neon/>. Thanks for the feedback. [-- Attachment #2: Type: text/html, Size: 4880 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-02-16 15:21 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-02-09 16:32 Use guix to distribute data & reproducible (data) science Amirouche Boubekki 2018-02-09 17:13 ` Ludovic Courtès 2018-02-09 17:48 ` zimoun 2018-02-09 19:15 ` Konrad Hinsen 2018-02-09 23:01 ` zimoun 2018-02-09 23:17 ` Ricardo Wurmus 2018-02-12 11:46 ` Konrad Hinsen 2018-02-14 4:43 ` Do you use packages in Guix to run neural networks? Fis Trivial 2018-02-14 6:07 ` Pjotr Prins 2018-02-14 7:27 ` Fis Trivial 2018-02-14 8:04 ` Konrad Hinsen 2018-02-10 9:51 ` Use guix to distribute data & reproducible (data) science Amirouche Boubekki 2018-02-10 11:28 ` zimoun 2018-02-14 13:06 ` Ludovic Courtès 2018-02-15 17:10 ` zimoun 2018-02-16 9:28 ` Konrad Hinsen 2018-02-16 14:33 ` myglc2 2018-02-16 15:20 ` Konrad Hinsen 2018-02-16 12:41 ` Amirouche Boubekki
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.