unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Re: guix and mirroring dataset
@ 2021-05-27  0:24 zimoun
  2021-05-27  4:37 ` Cook, Malcolm
  0 siblings, 1 reply; 4+ messages in thread
From: zimoun @ 2021-05-27  0:24 UTC (permalink / raw)
  To: Cook, Malcolm; +Cc: guix-devel

Hi,

> Does the guix project and members suggest best guix-ish practices for
> managing on premise mirrors of large file-based data-sets such as
> appear in genomics HPC evironments? 

From my understanding, it is still “unsolved“ and there is no clear
answer.

Basically, the /gnu/store is not designed for managing large dataset and
something is somehow missing.  On the mailing list gwl-devel@gnu.org, we
have already discussed that point although nothing came up, AFAIU.
Recently, we discussed again, see the thread:

<https://yhetil.org/gwl/87r1k2ti7k.fsf@elephly.net/T/#>

Your input is welcome. :-)

> Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework
> that facilitates reproducible access to genomic
> data](https://www.nature.com/articles/s41467-021-22381-z) 

AFAIR, Ricardo pointed this GoGetData.  Personally, I have not yet look
at the details.

> That would build on GWL?

From my understanding, something is missing between ’packages’,
’process’ and ’workflow’, for instance ’data’.  And speaking about
genomics, there is 2 kinds of large data:

 - fixed output (immutable?): think FASTA and FASTQ
 - computed output (mutable?): think BAM and indexes

and it is not clear how to deal with them.  And once that answered, how
to share them (substitutes)? HTTP as all are doing, but we could also
want IPFS or any other things which would avoid the mirroring/sync
issues. 

> Use cases would be, e.g. download/sync selected (versions of) genomes
> from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa,
> STAR, GMAP, HiSAT, IGV, BioConductor, etc... 
>
> I see much that addresses analysis workflows, such as
>  -  [Reproducible genomics analysis pipelines with GNU Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
>  - [Scalable Workflows and Reproducible Data Analysis for Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
>  - [PiGx: reproducible genomics analysis pipelines with GNU Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)
>
> Am I missing similar efforts toward maintaining an up-to-date catalog
> of the genomic resources that such workflows require? 

For now, some are maintained as packages, for instance:

  $ guix search "^r-" hg19 | recsel -C -P name
  r-phastcons100way-ucsc-hg19
  r-bsgenome-hsapiens-ucsc-hg19-masked
  r-txdb-hsapiens-ucsc-hg19-knowngene
  r-bsgenome-hsapiens-ucsc-hg19
  r-snplocs-hsapiens-dbsnp144-grch37
  r-illuminahumanmethylation450kanno-ilmn12-hg19
  r-fdb-infiniummethylation-hg19
  r-copyhelper

which are relative small, for another instance:

--8<---------------cut here---------------start------------->8---
r-txdb-hsapiens-ucsc-hg38-knowngene total: 91.8 MiB
r-bsgenome-hsapiens-ucsc-hg38 total: 765.2 MiB
r-copyhelper total: 42.9 MiB
--8<---------------cut here---------------end--------------->8---


Hope that helps,
simon


^ permalink raw reply	[flat|nested] 4+ messages in thread
* guix and mirroring dataset
@ 2021-05-17 19:04 Cook, Malcolm
  0 siblings, 0 replies; 4+ messages in thread
From: Cook, Malcolm @ 2021-05-17 19:04 UTC (permalink / raw)
  To: guix-devel@gnu.org

HI,

Does the guix project and members suggest best guix-ish practices for managing on premise mirrors of large file-based data-sets such as appear in genomics HPC evironments?

Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework that facilitates reproducible access to genomic data](https://www.nature.com/articles/s41467-021-22381-z)

That would build on GWL?

Use cases would be, e.g. download/sync selected (versions of) genomes from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa, STAR, GMAP, HiSAT, IGV, BioConductor, etc...

I see much that addresses analysis workflows, such as
 -  [Reproducible genomics analysis pipelines with GNU Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
 - [Scalable Workflows and Reproducible Data Analysis for Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
 - [PiGx: reproducible genomics analysis pipelines with GNU Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)

Am I missing similar efforts toward maintaining an up-to-date catalog of the genomic resources that such workflows require?

Thanks!

Malcolm Cook
Database Applications Manager
Stowers Institute for Medical Research
Kansas City, MO  USA



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-27 13:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27  0:24 guix and mirroring dataset zimoun
2021-05-27  4:37 ` Cook, Malcolm
2021-05-27 12:57   ` zimoun
  -- strict thread matches above, loose matches on Subject: below --
2021-05-17 19:04 Cook, Malcolm

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).