unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: "Cook, Malcolm" <MEC@stowers.org>
Cc: guix-devel@gnu.org
Subject: Re: guix and mirroring dataset
Date: Thu, 27 May 2021 02:24:30 +0200	[thread overview]
Message-ID: <86tumpnhvl.fsf@gmail.com> (raw)
In-Reply-To: DM6PR20MB3410FBC7EB2A4F230F19365CBE2D9@DM6PR20MB3410.namprd20.prod.outlook.com

Hi,

> Does the guix project and members suggest best guix-ish practices for
> managing on premise mirrors of large file-based data-sets such as
> appear in genomics HPC evironments? 

From my understanding, it is still “unsolved“ and there is no clear
answer.

Basically, the /gnu/store is not designed for managing large dataset and
something is somehow missing.  On the mailing list gwl-devel@gnu.org, we
have already discussed that point although nothing came up, AFAIU.
Recently, we discussed again, see the thread:

<https://yhetil.org/gwl/87r1k2ti7k.fsf@elephly.net/T/#>

Your input is welcome. :-)

> Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework
> that facilitates reproducible access to genomic
> data](https://www.nature.com/articles/s41467-021-22381-z) 

AFAIR, Ricardo pointed this GoGetData.  Personally, I have not yet look
at the details.

> That would build on GWL?

From my understanding, something is missing between ’packages’,
’process’ and ’workflow’, for instance ’data’.  And speaking about
genomics, there is 2 kinds of large data:

 - fixed output (immutable?): think FASTA and FASTQ
 - computed output (mutable?): think BAM and indexes

and it is not clear how to deal with them.  And once that answered, how
to share them (substitutes)? HTTP as all are doing, but we could also
want IPFS or any other things which would avoid the mirroring/sync
issues. 

> Use cases would be, e.g. download/sync selected (versions of) genomes
> from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa,
> STAR, GMAP, HiSAT, IGV, BioConductor, etc... 
>
> I see much that addresses analysis workflows, such as
>  -  [Reproducible genomics analysis pipelines with GNU Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
>  - [Scalable Workflows and Reproducible Data Analysis for Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
>  - [PiGx: reproducible genomics analysis pipelines with GNU Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)
>
> Am I missing similar efforts toward maintaining an up-to-date catalog
> of the genomic resources that such workflows require? 

For now, some are maintained as packages, for instance:

  $ guix search "^r-" hg19 | recsel -C -P name
  r-phastcons100way-ucsc-hg19
  r-bsgenome-hsapiens-ucsc-hg19-masked
  r-txdb-hsapiens-ucsc-hg19-knowngene
  r-bsgenome-hsapiens-ucsc-hg19
  r-snplocs-hsapiens-dbsnp144-grch37
  r-illuminahumanmethylation450kanno-ilmn12-hg19
  r-fdb-infiniummethylation-hg19
  r-copyhelper

which are relative small, for another instance:

--8<---------------cut here---------------start------------->8---
r-txdb-hsapiens-ucsc-hg38-knowngene total: 91.8 MiB
r-bsgenome-hsapiens-ucsc-hg38 total: 765.2 MiB
r-copyhelper total: 42.9 MiB
--8<---------------cut here---------------end--------------->8---


Hope that helps,
simon


             reply	other threads:[~2021-05-27  0:34 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27  0:24 zimoun [this message]
2021-05-27  4:37 ` guix and mirroring dataset Cook, Malcolm
2021-05-27 12:57   ` zimoun
  -- strict thread matches above, loose matches on Subject: below --
2021-05-17 19:04 Cook, Malcolm

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86tumpnhvl.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=MEC@stowers.org \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).