Re: guix and mirroring dataset

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Re: guix and mirroring dataset
@ 2021-05-27  0:24 zimoun
  2021-05-27  4:37 ` Cook, Malcolm
  0 siblings, 1 reply; 4+ messages in thread
From: zimoun @ 2021-05-27  0:24 UTC (permalink / raw)
  To: Cook, Malcolm; +Cc: guix-devel

Hi,

> Does the guix project and members suggest best guix-ish practices for
> managing on premise mirrors of large file-based data-sets such as
> appear in genomics HPC evironments? 

From my understanding, it is still “unsolved“ and there is no clear
answer.

Basically, the /gnu/store is not designed for managing large dataset and
something is somehow missing.  On the mailing list gwl-devel@gnu.org, we
have already discussed that point although nothing came up, AFAIU.
Recently, we discussed again, see the thread:

<https://yhetil.org/gwl/87r1k2ti7k.fsf@elephly.net/T/#>

Your input is welcome. :-)

> Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework
> that facilitates reproducible access to genomic
> data](https://www.nature.com/articles/s41467-021-22381-z) 

AFAIR, Ricardo pointed this GoGetData.  Personally, I have not yet look
at the details.

> That would build on GWL?

From my understanding, something is missing between ’packages’,
’process’ and ’workflow’, for instance ’data’.  And speaking about
genomics, there is 2 kinds of large data:

 - fixed output (immutable?): think FASTA and FASTQ
 - computed output (mutable?): think BAM and indexes

and it is not clear how to deal with them.  And once that answered, how
to share them (substitutes)? HTTP as all are doing, but we could also
want IPFS or any other things which would avoid the mirroring/sync
issues. 

> Use cases would be, e.g. download/sync selected (versions of) genomes
> from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa,
> STAR, GMAP, HiSAT, IGV, BioConductor, etc... 
>
> I see much that addresses analysis workflows, such as
>  -  [Reproducible genomics analysis pipelines with GNU Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
>  - [Scalable Workflows and Reproducible Data Analysis for Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
>  - [PiGx: reproducible genomics analysis pipelines with GNU Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)
>
> Am I missing similar efforts toward maintaining an up-to-date catalog
> of the genomic resources that such workflows require? 

For now, some are maintained as packages, for instance:

  $ guix search "^r-" hg19 | recsel -C -P name
  r-phastcons100way-ucsc-hg19
  r-bsgenome-hsapiens-ucsc-hg19-masked
  r-txdb-hsapiens-ucsc-hg19-knowngene
  r-bsgenome-hsapiens-ucsc-hg19
  r-snplocs-hsapiens-dbsnp144-grch37
  r-illuminahumanmethylation450kanno-ilmn12-hg19
  r-fdb-infiniummethylation-hg19
  r-copyhelper

which are relative small, for another instance:

--8<---------------cut here---------------start------------->8---
r-txdb-hsapiens-ucsc-hg38-knowngene total: 91.8 MiB
r-bsgenome-hsapiens-ucsc-hg38 total: 765.2 MiB
r-copyhelper total: 42.9 MiB
--8<---------------cut here---------------end--------------->8---


Hope that helps,
simon


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: guix and mirroring dataset
  2021-05-27  0:24 guix and mirroring dataset zimoun
@ 2021-05-27  4:37 ` Cook, Malcolm
  2021-05-27 12:57   ` zimoun
  0 siblings, 1 reply; 4+ messages in thread
From: Cook, Malcolm @ 2021-05-27  4:37 UTC (permalink / raw)
  To: zimoun; +Cc: guix-devel@gnu.org

>> Does the guix project and members suggest best guix-ish practices for
>> managing on premise mirrors of large file-based data-sets such as
>> appear in genomics HPC evironments? 
>
>From my understanding, it is still “unsolved“ and there is no clear
>answer.
>
>Basically, the /gnu/store is not designed for managing large dataset and
>something is somehow missing. On the mailing list mailto:gwl-devel@gnu.org, we
>have already discussed that point although nothing came up, AFAIU.
>Recently, we discussed again, see the thread:
>
><https://yhetil.org/gwl/87r1k2ti7k.fsf@elephly.net/T/>

Nice - I missed that thread. It brings up good considerations:
 - "immutable” v “mutable" resources
 - IPFS as possible means of distribution

>
>Your input is welcome. :-)

I was expecting to find workflows that have been developed for mirroring (downloading) genomic resources from sites such as Ensembl/NCBI/UCSC, etc, and then creating on-prem derived resources (e.g. blast indexes).  

I currently tend to do this with Gnu Make and shell scripting.

I was not expecting to find guix efforts toward maintaining such pre-computed derived datasets in upstream repository of any sort, though that would be valuable to some.  Illumina for instance (used to?) keep selected genome indices for use with their software.  But that is not what I seek....   and I think much of your remaining reply assumes it is.

>> Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework
>> that facilitates reproducible access to genomic
>> data](https://www.nature.com/articles/s41467-021-22381-z) 
>
>AFAIR, Ricardo pointed this GoGetData. Personally, I have not yet look
>at the details.

GoGetData does not seek to make upstream derived datasets available.  Rather their aim is to provide "as a fast, reproducible approach to installing standardized data recipes".  I assume GWL would be a good language to write such recipes, and that someone may already be doing so....

GoGetData recipes are just bash scripts organized in a particular folder structure in a github repo that are expected to comport to a few conventions (e.g. variable names for genomes, species, etc) with a required yaml schema for their metadata.  The do not have any advanced workflow capabilities such as GWL might provide.

>> That would build on GWL?
>
>From my understanding, something is missing between ’packages’,
>’process’ and ’workflow’, for instance ’data’. And speaking about
>genomics, there is 2 kinds of large data:
>
>- fixed output (immutable?): think FASTA and FASTQ
>- computed output (mutable?): think BAM and indexes
>
>and it is not clear how to deal with them. And once that answered, how
>to share them (substitutes)? HTTP as all are doing, but we could also
>want IPFS or any other things which would avoid the mirroring/sync
>issues. 
>
>> Use cases would be, e.g. download/sync selected (versions of) genomes
>> from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa,
>> STAR, GMAP, HiSAT, IGV, BioConductor, etc... 
>>
>> I see much that addresses analysis workflows, such as
>> - [Reproducible genomics analysis pipelines with GNU Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
>> - [Scalable Workflows and Reproducible Data Analysis for Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
>> - [PiGx: reproducible genomics analysis pipelines with GNU Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)
>>
>> Am I missing similar efforts toward maintaining an up-to-date catalog
>> of the genomic resources that such workflows require? 
>
>For now, some are maintained as packages, for instance:
>
>$ guix search "^r-" hg19 | recsel -C -P name
>r-phastcons100way-ucsc-hg19
>r-bsgenome-hsapiens-ucsc-hg19-masked
>r-txdb-hsapiens-ucsc-hg19-knowngene
>r-bsgenome-hsapiens-ucsc-hg19
>r-snplocs-hsapiens-dbsnp144-grch37
>r-illuminahumanmethylation450kanno-ilmn12-hg19
>r-fdb-infiniummethylation-hg19
>r-copyhelper

Yes, thanks, I see that guix has versions of BioConductor data packages.  These are interesting use case.

>
>which are relative small, for another instance:
>
>--8<---------------cut here---------------start------------->8---
>r-txdb-hsapiens-ucsc-hg38-knowngene total: 91.8 MiB
>r-bsgenome-hsapiens-ucsc-hg38 total: 765.2 MiB
>r-copyhelper total: 42.9 MiB
>--8<---------------cut here---------------end--------------->8---
>
>
>Hope that helps,
>simon

Thanks Simon, I'm pleased to have your thoughts and pointers on this topic...

~Malcolm

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: guix and mirroring dataset
  2021-05-27  4:37 ` Cook, Malcolm
@ 2021-05-27 12:57   ` zimoun
  0 siblings, 0 replies; 4+ messages in thread
From: zimoun @ 2021-05-27 12:57 UTC (permalink / raw)
  To: Cook, Malcolm; +Cc: guix-devel@gnu.org

Hi,

On Thu, 27 May 2021 at 04:37, "Cook, Malcolm" <MEC@stowers.org> wrote:

>>Your input is welcome. :-)
>
> I was expecting to find workflows that have been developed for
> mirroring (downloading) genomic resources from sites such as
> Ensembl/NCBI/UCSC, etc, and then creating on-prem derived resources
> (e.g. blast indexes).

Sorry, I missed that what you were asking. :-)

The answer is: nothing ready-to-use.  Well, feel free to start a
discussion on gwl-devel@gnu.org.

> I currently tend to do this with Gnu Make and shell scripting.

If you already have some shell scripts, I think the effort is not too
high to wrap around some GWL glue. ;-)

And maybe it could be worth to see if a common organisation makes sense
here for the various cases.

Cheers,
simon

^ permalink raw reply	[flat|nested] 4+ messages in thread

* guix and mirroring dataset
@ 2021-05-17 19:04 Cook, Malcolm
  0 siblings, 0 replies; 4+ messages in thread
From: Cook, Malcolm @ 2021-05-17 19:04 UTC (permalink / raw)
  To: guix-devel@gnu.org

HI,

Does the guix project and members suggest best guix-ish practices for managing on premise mirrors of large file-based data-sets such as appear in genomics HPC evironments?

Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework that facilitates reproducible access to genomic data](https://www.nature.com/articles/s41467-021-22381-z)

That would build on GWL?

Use cases would be, e.g. download/sync selected (versions of) genomes from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa, STAR, GMAP, HiSAT, IGV, BioConductor, etc...

I see much that addresses analysis workflows, such as
 -  [Reproducible genomics analysis pipelines with GNU Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
 - [Scalable Workflows and Reproducible Data Analysis for Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
 - [PiGx: reproducible genomics analysis pipelines with GNU Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)

Am I missing similar efforts toward maintaining an up-to-date catalog of the genomic resources that such workflows require?

Thanks!

Malcolm Cook
Database Applications Manager
Stowers Institute for Medical Research
Kansas City, MO  USA

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-27 13:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27  0:24 guix and mirroring dataset zimoun
2021-05-27  4:37 ` Cook, Malcolm
2021-05-27 12:57   ` zimoun
  -- strict thread matches above, loose matches on Subject: below --
2021-05-17 19:04 Cook, Malcolm

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).