Where should we put machine learning model parameters ?

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Where should we put machine learning model parameters ?
@ 2023-04-03 16:48 Nicolas Graves via Development of GNU Guix and the GNU System distribution.
  2023-04-03 19:12 ` Kyle
  2023-04-06 18:55 ` how to deal with large dataset? (was Re: Where should we put machine learning model parameters ?) Simon Tournier
  0 siblings, 2 replies; 3+ messages in thread
From: Nicolas Graves via Development of GNU Guix and the GNU System distribution. @ 2023-04-03 16:48 UTC (permalink / raw)
  To: guix-devel

Hi Guix!

I've recently contributed a few tools that make a few OSS machine
learning programs usable for Guix, namely nerd-dictation for dictation
and llama-cpp as a converstional bot.

In the first case, I would also like to contribute parameters of some
localized models so that they can be used more easily through Guix. I've
already discussed this subject when submitting these patches, without a
clear answer.

In the case of nerd-dictation, the model parameters that can be used
are listed here : https://alphacephei.com/vosk/models

One caveat is that using all these models can take a lot of space on the
servers, a burden which is not useful because no build step are really
needed (except an unzip step). In this case, we can use the
#:substitutable? #f flag. You can find an example of some of these
packages right here :
https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm

So my question is: Should we add this type of models in packages for
Guix? If yes, where should we put them? In machine-learning.scm? In a
new file machine-learning-models.scm (such a file would never need new
modules, and it might avoid some confusion between the tools and the
parameters needed to use the tools)?

-- 
Best regards,
Nicolas Graves

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Where should we put machine learning model parameters ?
  2023-04-03 16:48 Where should we put machine learning model parameters ? Nicolas Graves via Development of GNU Guix and the GNU System distribution.
@ 2023-04-03 19:12 ` Kyle
  2023-04-06 18:55 ` how to deal with large dataset? (was Re: Where should we put machine learning model parameters ?) Simon Tournier
  1 sibling, 0 replies; 3+ messages in thread
From: Kyle @ 2023-04-03 19:12 UTC (permalink / raw)
  To: Nicolas Graves via Development of GNU Guix and the GNU System distribution.,
	guix-devel

[-- Attachment #1: Type: text/plain, Size: 2113 bytes --]

My view as a statistician and Guix user is that trained machine learning models should at best be provided as substitutes. They are opaque binary artifacts of purely digital compilation processes and should not be treated exceptionally to any other build artifact.

It would seem to me most consistent with the goals of the project to insist on fully reproducible builds for machine learning models for them to be considered for inclusion into the main Guix distribution.

Full reproducibility would make the space requirements for including them even bigger than just the parameters but would ensure that the four freedoms could be preserved.



On April 3, 2023 12:48:12 PM EDT, "Nicolas Graves via Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote:
>
>Hi Guix!
>
>I've recently contributed a few tools that make a few OSS machine
>learning programs usable for Guix, namely nerd-dictation for dictation
>and llama-cpp as a converstional bot.
>
>In the first case, I would also like to contribute parameters of some
>localized models so that they can be used more easily through Guix. I've
>already discussed this subject when submitting these patches, without a
>clear answer.
>
>In the case of nerd-dictation, the model parameters that can be used
>are listed here : https://alphacephei.com/vosk/models
>
>One caveat is that using all these models can take a lot of space on the
>servers, a burden which is not useful because no build step are really
>needed (except an unzip step). In this case, we can use the
>#:substitutable? #f flag. You can find an example of some of these
>packages right here :
>https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm
>
>So my question is: Should we add this type of models in packages for
>Guix? If yes, where should we put them? In machine-learning.scm? In a
>new file machine-learning-models.scm (such a file would never need new
>modules, and it might avoid some confusion between the tools and the
>parameters needed to use the tools)?
>
>
>-- 
>Best regards,
>Nicolas Graves
>

[-- Attachment #2: Type: text/html, Size: 2499 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* how to deal with large dataset? (was Re: Where should we put machine learning model parameters ?)
  2023-04-03 16:48 Where should we put machine learning model parameters ? Nicolas Graves via Development of GNU Guix and the GNU System distribution.
  2023-04-03 19:12 ` Kyle
@ 2023-04-06 18:55 ` Simon Tournier
  1 sibling, 0 replies; 3+ messages in thread
From: Simon Tournier @ 2023-04-06 18:55 UTC (permalink / raw)
  To: Nicolas Graves, guix-devel; +Cc: Ricardo Wurmus

Hi,

Well, we already discussed in GWL context where to put “large” data set
without reaching a conclusion.  Having “large” data set inside the store
is probably not a good idea.  But maybe these data of models are not
that “large” to worry about the store.


On lun., 03 avril 2023 at 18:48, Nicolas Graves via "Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote:

> In the case of nerd-dictation, the model parameters that can be used
> are listed here : https://alphacephei.com/vosk/models

Here, it is not that large…

--8<---------------cut here---------------start------------->8---
vosk-model-en-us-0.22 	           1.8G
[...]
vosk-model-en-us-0.42-gigaspeech   2.3G
[...]
vosk-model-ru-0.10 	           2.5G
--8<---------------cut here---------------end--------------->8---

…compared to already some packages about data:

--8<---------------cut here---------------start------------->8---
$ for p in $(guix build -S $(guix package -A 'r\-' | grep genome | cut -f1)); do du -sh $p ;done | sort -hr | head -9
807M	/gnu/store/x2540idvd9pfmwz7ix04wm6ks58zwqkm-BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000.tar.gz
692M	/gnu/store/0vnlm5z2gkmzk2kkxzlab787kqjiw5g9-BSgenome.Hsapiens.UCSC.hg38_1.4.4.tar.gz
678M	/gnu/store/ngvghqhmjzscfxgzc1b9b4djws5rfzws-BSgenome.Hsapiens.UCSC.hg19_1.4.3.tar.gz
656M	/gnu/store/187smrknx3k5avhqapswrj40zh24h966-BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1.tar.gz
601M	/gnu/store/c15pc126x7k54yrqmbfwgg7gxkgbm9ip-BSgenome.Mmusculus.UCSC.mm10_1.4.0.tar.gz
598M	/gnu/store/cwsm9lqfmd1y9mwsx4sq4rzf45br6by2-BSgenome.Btaurus.UCSC.bosTau8_1.4.2.tar.gz
594M	/gnu/store/jky74snf2vr2r3s9c5131vacql6rna6a-BSgenome.Mmusculus.UCSC.mm9_1.4.0.tar.gz
374M	/gnu/store/zjzjag2zd408xnj5nq9ckfpcx22h7m4j-BSgenome.Drerio.UCSC.danRer11_1.4.2.tar.gz
37M	/gnu/store/abfk8jwhdd7d62jybfbvrgl682db7q2w-BSgenome.Dmelanogaster.UCSC.dm3_1.4.0.tar.gz
--8<---------------cut here---------------end--------------->8---

but still.  Well, I do not know if this data set of 2G fits the store
but I do not have better to propose.


> One caveat is that using all these models can take a lot of space on the
> servers, a burden which is not useful because no build step are really
> needed (except an unzip step). In this case, we can use the
> #:substitutable? #f flag. You can find an example of some of these
> packages right here :
> https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm

It is what is done for some packages in gnu/packages/bioconductor.scm

https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/bioconductor.scm#n904


> So my question is: Should we add this type of models in packages for
> Guix? If yes, where should we put them? In machine-learning.scm? In a
> new file machine-learning-models.scm (such a file would never need new
> modules, and it might avoid some confusion between the tools and the
> parameters needed to use the tools)?

Well, gnu/packages/machine-learning-data.scm or s/data/models sounds
good to me.


Cheers,
simon


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-04-06 18:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-03 16:48 Where should we put machine learning model parameters ? Nicolas Graves via Development of GNU Guix and the GNU System distribution.
2023-04-03 19:12 ` Kyle
2023-04-06 18:55 ` how to deal with large dataset? (was Re: Where should we put machine learning model parameters ?) Simon Tournier

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.