unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Simon Tournier <zimon.toutoune@gmail.com>
To: Nicolas Graves <ngraves@ngraves.fr>, guix-devel@gnu.org
Cc: Ricardo Wurmus <rekado@elephly.net>
Subject: how to deal with large dataset? (was Re: Where should we put machine learning model parameters ?)
Date: Thu, 06 Apr 2023 20:55:55 +0200	[thread overview]
Message-ID: <878rf4n8lg.fsf@gmail.com> (raw)
In-Reply-To: <87jzyshpyr.fsf@ngraves.fr>

Hi,

Well, we already discussed in GWL context where to put “large” data set
without reaching a conclusion.  Having “large” data set inside the store
is probably not a good idea.  But maybe these data of models are not
that “large” to worry about the store.


On lun., 03 avril 2023 at 18:48, Nicolas Graves via "Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote:

> In the case of nerd-dictation, the model parameters that can be used
> are listed here : https://alphacephei.com/vosk/models

Here, it is not that large…

--8<---------------cut here---------------start------------->8---
vosk-model-en-us-0.22 	           1.8G
[...]
vosk-model-en-us-0.42-gigaspeech   2.3G
[...]
vosk-model-ru-0.10 	           2.5G
--8<---------------cut here---------------end--------------->8---

…compared to already some packages about data:

--8<---------------cut here---------------start------------->8---
$ for p in $(guix build -S $(guix package -A 'r\-' | grep genome | cut -f1)); do du -sh $p ;done | sort -hr | head -9
807M	/gnu/store/x2540idvd9pfmwz7ix04wm6ks58zwqkm-BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000.tar.gz
692M	/gnu/store/0vnlm5z2gkmzk2kkxzlab787kqjiw5g9-BSgenome.Hsapiens.UCSC.hg38_1.4.4.tar.gz
678M	/gnu/store/ngvghqhmjzscfxgzc1b9b4djws5rfzws-BSgenome.Hsapiens.UCSC.hg19_1.4.3.tar.gz
656M	/gnu/store/187smrknx3k5avhqapswrj40zh24h966-BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1.tar.gz
601M	/gnu/store/c15pc126x7k54yrqmbfwgg7gxkgbm9ip-BSgenome.Mmusculus.UCSC.mm10_1.4.0.tar.gz
598M	/gnu/store/cwsm9lqfmd1y9mwsx4sq4rzf45br6by2-BSgenome.Btaurus.UCSC.bosTau8_1.4.2.tar.gz
594M	/gnu/store/jky74snf2vr2r3s9c5131vacql6rna6a-BSgenome.Mmusculus.UCSC.mm9_1.4.0.tar.gz
374M	/gnu/store/zjzjag2zd408xnj5nq9ckfpcx22h7m4j-BSgenome.Drerio.UCSC.danRer11_1.4.2.tar.gz
37M	/gnu/store/abfk8jwhdd7d62jybfbvrgl682db7q2w-BSgenome.Dmelanogaster.UCSC.dm3_1.4.0.tar.gz
--8<---------------cut here---------------end--------------->8---

but still.  Well, I do not know if this data set of 2G fits the store
but I do not have better to propose.


> One caveat is that using all these models can take a lot of space on the
> servers, a burden which is not useful because no build step are really
> needed (except an unzip step). In this case, we can use the
> #:substitutable? #f flag. You can find an example of some of these
> packages right here :
> https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm

It is what is done for some packages in gnu/packages/bioconductor.scm

https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/bioconductor.scm#n904


> So my question is: Should we add this type of models in packages for
> Guix? If yes, where should we put them? In machine-learning.scm? In a
> new file machine-learning-models.scm (such a file would never need new
> modules, and it might avoid some confusion between the tools and the
> parameters needed to use the tools)?

Well, gnu/packages/machine-learning-data.scm or s/data/models sounds
good to me.


Cheers,
simon


      parent reply	other threads:[~2023-04-06 18:56 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-03 16:48 Where should we put machine learning model parameters ? Nicolas Graves via Development of GNU Guix and the GNU System distribution.
2023-04-03 19:12 ` Kyle
2023-04-06 18:55 ` Simon Tournier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878rf4n8lg.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=ngraves@ngraves.fr \
    --cc=rekado@elephly.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).