From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id 2GUxNizprmC2NwAAgWs5BA (envelope-from ) for ; Thu, 27 May 2021 02:34:52 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id sL32MSzprmAPOQAAB5/wlQ (envelope-from ) for ; Thu, 27 May 2021 00:34:52 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 6071A14F0E for ; Thu, 27 May 2021 02:34:52 +0200 (CEST) Received: from localhost ([::1]:51180 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lm3z9-0002v2-AK for larch@yhetil.org; Wed, 26 May 2021 20:34:51 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:41362) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lm3yq-0002tl-Bx for guix-devel@gnu.org; Wed, 26 May 2021 20:34:35 -0400 Received: from mail-wm1-x331.google.com ([2a00:1450:4864:20::331]:55852) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lm3yk-0006Ie-Il for guix-devel@gnu.org; Wed, 26 May 2021 20:34:31 -0400 Received: by mail-wm1-x331.google.com with SMTP id 16so1693737wmj.5 for ; Wed, 26 May 2021 17:34:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:in-reply-to:date:message-id:mime-version :content-transfer-encoding; bh=9n4HmgSBCc6VHjY/vp/JjkRm1CEN5tqFOhq5utq55dQ=; b=jMZsTNFSQH65gJnwjgkvUGHv0imKpvNK7SWrpNvG5k+S3eO1UxOV7M9tXCSHFYhT// 9gzflnpXkILpcKKEhUjb/KhypsTMb5nLR0j5THzAyyZ8/00KQjM2Vh3VCIUmd8zsFHxt jrbqlPly7Xwh47oJ3Svk3UXGDGinoQ3H5yUWh8+5whi8ynC8lqutbfcX0ZqymdQALygV FUByBFEgvRah8+M2v2GLs2qTY3l55jwq9AUI4jTvaY2mY5zpNFsE5Ro62tZnyDvgoBgp fPuOycFPCOyqq0K0DoTM0XDdt3L23SFjdY7Gg5U9+Iia1/6mns1SbyV4UadrR+rOn2oD A34g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:date:message-id :mime-version:content-transfer-encoding; bh=9n4HmgSBCc6VHjY/vp/JjkRm1CEN5tqFOhq5utq55dQ=; b=ue0/9RQkZ0Oxpc7M57wlgMyDeaKalK602hqWdkoJYOMFFfQhdvw2IZUWZAAh69CTi6 h6BChbk0hcRkEt6N5NRXu1eHQR9jiGJf31aKvNAc8A7y+lkR4Tr6OmUd+A180VMcV+79 7jLTBW/+/SkXa17Ij5P/C1W1ilQletnd8Go+3iHR1tRgO2Qg3WqTFFTYj80M4J75aXsp Pb9FFKDZKJkG9BlwKaxccleKdLXc1SvDvjhTh83Y5RV3fVB2SKjkLFUQF9fX2S8jEVKp W07FM7m/oax0IKnDcFHVkiL8xBuJASgHfbxcX2kNiQ1LIEvcw7vV7S6h5HZkDORC3nMW miCQ== X-Gm-Message-State: AOAM530LGzKXHdqJYAZJu76i59dqlMefAME/HBAb+CY1FL53IO9CB+nk SfGm5sBi7WV1FjYhESWdEBeFdmSduEQ= X-Google-Smtp-Source: ABdhPJxP/72Lb26elFbpm26an0Xy44KcJeO9bicCikuGpzsEdfw8a4OEfbnVZP5zgTjwV4hM6zCPDw== X-Received: by 2002:a05:600c:283:: with SMTP id 3mr812332wmk.174.1622075664562; Wed, 26 May 2021 17:34:24 -0700 (PDT) Received: from lili ([2a01:e0a:59b:9120:65d2:2476:f637:db1e]) by smtp.gmail.com with ESMTPSA id s134sm707683wme.6.2021.05.26.17.34.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 May 2021 17:34:24 -0700 (PDT) From: zimoun To: "Cook, Malcolm" Subject: Re: guix and mirroring dataset In-Reply-To: DM6PR20MB3410FBC7EB2A4F230F19365CBE2D9@DM6PR20MB3410.namprd20.prod.outlook.com Date: Thu, 27 May 2021 02:24:30 +0200 Message-ID: <86tumpnhvl.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::331; envelope-from=zimon.toutoune@gmail.com; helo=mail-wm1-x331.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel@gnu.org Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1622075692; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=9n4HmgSBCc6VHjY/vp/JjkRm1CEN5tqFOhq5utq55dQ=; b=nHaMzlCBtXhzRTUQTn2s83Wklmgmokltrp7vYbWX3HRnjq41JDAnQ9mymrFYPa06wQNbh0 g+FHpgYAbxUgtwCJPC5FJXJnaVgEP1cSJF3qDa9WrGJsvgJ6eAcLEBCRBl/6nx1l3aCTSg vAHRU91WLikzOVVDflTglEEqcQHSZBj2m5Fe41chksAqUkphTNs8VXgY5+XZ373qvS7Hop gYdrwZsBKfS2Ovinj4wj7HGAzstVRBeiRNXYZpDn51CdPNrvjYrAFmjObU0MPeYsQoI/lr LroVZR5nkDagBdbLll1AYniha14anYlry3wS6Au/4rtJBKlHdwsyTKtvsouWSg== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1622075692; a=rsa-sha256; cv=none; b=kwfPN74sPuRlNA+fHe61IK4HzjMbTamwhns7rMm/sCjS/KB2uoTOBsmrH/Ef5sigdJLYjA FjWZqJiMoUsH0QBs9nUVh3QOOo0o/lPm9mnZiRm6kuEu4+pEjfGWV2I+ZYRwOMeAMtx49e EovpBiYPSkR7PdJHM1F4WB0T9/szeKSNSXSD4gmZ3vWxhIIO1LyyZNuYHu1wwbiS7b1Nwq awv9tuik4XjY5coWB8pOPgJtG+FvhhAzS415OAs+Ad5/5rDwZpzkfm+El/Wa3YNw0Q4hmr 8vZXQntoGNcsm9jRkAzMo5XLpvZz28fEKdbmAgTlBUic20i/PIlbco8hFveUIw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20161025 header.b=jMZsTNFS; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Spam-Score: -3.13 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20161025 header.b=jMZsTNFS; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 6071A14F0E X-Spam-Score: -3.13 X-Migadu-Scanner: scn1.migadu.com X-TUID: iW40bCmfE3Hs Hi, > Does the guix project and members suggest best guix-ish practices for > managing on premise mirrors of large file-based data-sets such as > appear in genomics HPC evironments?=20 >From my understanding, it is still =E2=80=9Cunsolved=E2=80=9C and there is = no clear answer. Basically, the /gnu/store is not designed for managing large dataset and something is somehow missing. On the mailing list gwl-devel@gnu.org, we have already discussed that point although nothing came up, AFAIU. Recently, we discussed again, see the thread: Your input is welcome. :-) > Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework > that facilitates reproducible access to genomic > data](https://www.nature.com/articles/s41467-021-22381-z)=20 AFAIR, Ricardo pointed this GoGetData. Personally, I have not yet look at the details. > That would build on GWL? >From my understanding, something is missing between =E2=80=99packages=E2=80= =99, =E2=80=99process=E2=80=99 and =E2=80=99workflow=E2=80=99, for instance =E2= =80=99data=E2=80=99. And speaking about genomics, there is 2 kinds of large data: - fixed output (immutable?): think FASTA and FASTQ - computed output (mutable?): think BAM and indexes and it is not clear how to deal with them. And once that answered, how to share them (substitutes)? HTTP as all are doing, but we could also want IPFS or any other things which would avoid the mirroring/sync issues.=20 > Use cases would be, e.g. download/sync selected (versions of) genomes > from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa, > STAR, GMAP, HiSAT, IGV, BioConductor, etc...=20 > > I see much that addresses analysis workflows, such as > - [Reproducible genomics analysis pipelines with GNU Guix](https://www.= biorxiv.org/content/10.1101/298653v2.full) > - [Scalable Workflows and Reproducible Data Analysis for Genomics](https= ://pubmed.ncbi.nlm.nih.gov/31278683/) > - [PiGx: reproducible genomics analysis pipelines with GNU Guix](https:/= /academic.oup.com/gigascience/article/7/12/giy123/5114263) > > Am I missing similar efforts toward maintaining an up-to-date catalog > of the genomic resources that such workflows require?=20 For now, some are maintained as packages, for instance: $ guix search "^r-" hg19 | recsel -C -P name r-phastcons100way-ucsc-hg19 r-bsgenome-hsapiens-ucsc-hg19-masked r-txdb-hsapiens-ucsc-hg19-knowngene r-bsgenome-hsapiens-ucsc-hg19 r-snplocs-hsapiens-dbsnp144-grch37 r-illuminahumanmethylation450kanno-ilmn12-hg19 r-fdb-infiniummethylation-hg19 r-copyhelper which are relative small, for another instance: --8<---------------cut here---------------start------------->8--- r-txdb-hsapiens-ucsc-hg38-knowngene total: 91.8 MiB r-bsgenome-hsapiens-ucsc-hg38 total: 765.2 MiB r-copyhelper total: 42.9 MiB --8<---------------cut here---------------end--------------->8--- Hope that helps, simon