From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp10.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id aM6BM+4VL2Sw4gAASxT56A (envelope-from ) for ; Thu, 06 Apr 2023 20:56:46 +0200 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp10.migadu.com with LMTPS id 8BS+Mu4VL2TodwEAG6o9tA (envelope-from ) for ; Thu, 06 Apr 2023 20:56:46 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 869FD1B3DB for ; Thu, 6 Apr 2023 20:56:46 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pkUmC-0006yE-CH; Thu, 06 Apr 2023 14:56:04 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pkUmA-0006xO-W0 for guix-devel@gnu.org; Thu, 06 Apr 2023 14:56:03 -0400 Received: from mail-wr1-x433.google.com ([2a00:1450:4864:20::433]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pkUm8-0000kK-Kp for guix-devel@gnu.org; Thu, 06 Apr 2023 14:56:02 -0400 Received: by mail-wr1-x433.google.com with SMTP id ffacd0b85a97d-2ef2d5c92f5so54967f8f.0 for ; Thu, 06 Apr 2023 11:56:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680807359; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id :reply-to; bh=y+W7/Y2pof2TaiK/qnj0GKt+bfRL4rh4mRJPnlL8iyc=; b=lIkuNYh123zbWzcx8Tadz4QPEqPlkm1ZNz9tgoQBER0j4Q+mNCAmgtrFER2iQfbNoh FMqTEk8WUBEZQFU2ltnZEr4BERhKyETrCpS9SkuYgFcitrBsJV4RoxBImSEItEazlEEw TxDX5T08i/XZnlTfF3D3IYnN/CsZLZekQUZmiNHPrCfPhYQ9l0Scuf1iJJ92EEtPNgAq CjL0M8lyuMHjo0bIjni0ODUoyhh54iyp5c/CdxM7L/Y9bhWghcmCeNQcTX+9hPuRKFeH Rwrirk/bbKH18+T4jW0s9ILMIxRnKKdiyheI/T07wfVbpKYNOn0murLgpMmF1QONqvdx cOcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680807359; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y+W7/Y2pof2TaiK/qnj0GKt+bfRL4rh4mRJPnlL8iyc=; b=fR9Vrr5OSqlodPO3fiIxdnfwSE3vNMGfNGaSgdH3R+3U5SKy94lKIdpNkewCeIHMLg LGVUbOfrnEOezj4V7rIZuXAQCu9Nudkv9xgVP5VZJlQR7f8cGZNnkF1EGOPsbmouWTdl dEDFofhJc16dJWbdpBXk1F6Hk3bn3mwZ1RYrKp0T0kdQg/x8ZPSRClFmblFwJxBelmRU DidhJ7jfwZBQEbPY/uTHhJdo/r/P2IFbpS5n2HLz0WNTcbBbSJCqDRsf9LAGYtcM2Q4W INMO3/sW14esHhMXziGxQe7caqPnCzan1dFqNaM5UwfgTNvSbatduuPBHi8ZPEpKqtTc OgRA== X-Gm-Message-State: AAQBX9f2eZN0ohxCxh0I53Ft5MTSH20lUBQPjlr+KjMQsrUvDDZziJiB StL+WaxrsPXJbOFj2/9mVWGxKUp8FUI= X-Google-Smtp-Source: AKy350apF8Rzj/nYfJPx0/1CW7Yaxwx5paUciVyAJAlpKIj30f1a8uU0u3xblG1JkmegsHhu4vQEOQ== X-Received: by 2002:a5d:4a06:0:b0:2d4:fe00:cdd0 with SMTP id m6-20020a5d4a06000000b002d4fe00cdd0mr4104107wrq.2.1680807358872; Thu, 06 Apr 2023 11:55:58 -0700 (PDT) Received: from pfiuh07 ([193.48.40.241]) by smtp.gmail.com with ESMTPSA id k6-20020adff286000000b002eaac3a9beesm2528176wro.8.2023.04.06.11.55.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 06 Apr 2023 11:55:58 -0700 (PDT) From: Simon Tournier To: Nicolas Graves , guix-devel@gnu.org Cc: Ricardo Wurmus Subject: how to deal with large dataset? (was Re: Where should we put machine learning model parameters ?) In-Reply-To: <87jzyshpyr.fsf@ngraves.fr> References: <87jzyshpyr.fsf@ngraves.fr> Date: Thu, 06 Apr 2023 20:55:55 +0200 Message-ID: <878rf4n8lg.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::433; envelope-from=zimon.toutoune@gmail.com; helo=mail-wr1-x433.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN ARC-Seal: i=1; s=key1; d=yhetil.org; t=1680807406; a=rsa-sha256; cv=none; b=UftHDjhOo7X38qQbtsH3tn9qcPHqALzJUt7/KyvFbxCoQ/1H7dTcgg7zGkWjlnMzOzq424 j9LECAaQkFyEHYH3Vn1hzY/mVjTQY8gAsj8Lraygjm8B0xRIk88mSMrBRHItAkiEUytYU/ frTtOjIvEecgIPCZXO7OpYkiyNcukDANWrS/Q4HX3xMVGD31IuRjEaXW3uoTCU/Lt7aAcQ vuvZ5D+gmkSjg3UcR0TRdDeEdIv3qZHkgDKeRg/KrwgBmvqQbAROefyDbL5/J+dugpgFCo UKMI56SIeBUaEvVA9tLnO3698vdwchsXzzZYz/5lEVOa5AWUwV0FDTSHdp960w== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=lIkuNYh1; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1680807406; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=y+W7/Y2pof2TaiK/qnj0GKt+bfRL4rh4mRJPnlL8iyc=; b=UbC0FwQZJQUJpwpJNBefxMN8yROyY1vfV9NjEMFKifBGR/C+DvoV2PfA98fh6JStarzKDu 3CW2QOtaCyXCk8T9ZWAbVecwZ0rEaEjKtJxNvIsfKVSbXBEYvTjmPDpWk3tpkggJEABDps xVR2IG7nW0FnCVgO4NccLeymwqssoibbULD5+Mn9XofBmZhva6fFNFIUnpvQu+LkZQ38Oo 4TkCqVKdt4LWVktci5jWbWOpURHmy15tY0LdwU3bOlDSeWigugHUmOw3psFb3f2lA3FZOi jh4ZWxOTQeA9nhWFJCUZr9FhQiMdOJXEy2/rGGehnQEt5gYAKYszrCTa/WVoMw== Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=lIkuNYh1; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Migadu-Scanner: scn0.migadu.com X-Migadu-Spam-Score: -0.95 X-Spam-Score: -0.95 X-Migadu-Queue-Id: 869FD1B3DB X-TUID: Vzahs23+Dh3n Hi, Well, we already discussed in GWL context where to put =E2=80=9Clarge=E2=80= =9D data set without reaching a conclusion. Having =E2=80=9Clarge=E2=80=9D data set ins= ide the store is probably not a good idea. But maybe these data of models are not that =E2=80=9Clarge=E2=80=9D to worry about the store. On lun., 03 avril 2023 at 18:48, Nicolas Graves via "Development of GNU Gui= x and the GNU System distribution." wrote: > In the case of nerd-dictation, the model parameters that can be used > are listed here : https://alphacephei.com/vosk/models Here, it is not that large=E2=80=A6 --8<---------------cut here---------------start------------->8--- vosk-model-en-us-0.22 1.8G [...] vosk-model-en-us-0.42-gigaspeech 2.3G [...] vosk-model-ru-0.10 2.5G --8<---------------cut here---------------end--------------->8--- =E2=80=A6compared to already some packages about data: --8<---------------cut here---------------start------------->8--- $ for p in $(guix build -S $(guix package -A 'r\-' | grep genome | cut -f1)= ); do du -sh $p ;done | sort -hr | head -9 807M /gnu/store/x2540idvd9pfmwz7ix04wm6ks58zwqkm-BSgenome.Hsapiens.NCBI.GRC= h38_1.3.1000.tar.gz 692M /gnu/store/0vnlm5z2gkmzk2kkxzlab787kqjiw5g9-BSgenome.Hsapiens.UCSC.hg3= 8_1.4.4.tar.gz 678M /gnu/store/ngvghqhmjzscfxgzc1b9b4djws5rfzws-BSgenome.Hsapiens.UCSC.hg1= 9_1.4.3.tar.gz 656M /gnu/store/187smrknx3k5avhqapswrj40zh24h966-BSgenome.Hsapiens.1000geno= mes.hs37d5_0.99.1.tar.gz 601M /gnu/store/c15pc126x7k54yrqmbfwgg7gxkgbm9ip-BSgenome.Mmusculus.UCSC.mm= 10_1.4.0.tar.gz 598M /gnu/store/cwsm9lqfmd1y9mwsx4sq4rzf45br6by2-BSgenome.Btaurus.UCSC.bosT= au8_1.4.2.tar.gz 594M /gnu/store/jky74snf2vr2r3s9c5131vacql6rna6a-BSgenome.Mmusculus.UCSC.mm= 9_1.4.0.tar.gz 374M /gnu/store/zjzjag2zd408xnj5nq9ckfpcx22h7m4j-BSgenome.Drerio.UCSC.danRe= r11_1.4.2.tar.gz 37M /gnu/store/abfk8jwhdd7d62jybfbvrgl682db7q2w-BSgenome.Dmelanogaster.UCSC= .dm3_1.4.0.tar.gz --8<---------------cut here---------------end--------------->8--- but still. Well, I do not know if this data set of 2G fits the store but I do not have better to propose. > One caveat is that using all these models can take a lot of space on the > servers, a burden which is not useful because no build step are really > needed (except an unzip step). In this case, we can use the > #:substitutable? #f flag. You can find an example of some of these > packages right here : > https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm It is what is done for some packages in gnu/packages/bioconductor.scm https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/bioconductor.s= cm#n904 > So my question is: Should we add this type of models in packages for > Guix? If yes, where should we put them? In machine-learning.scm? In a > new file machine-learning-models.scm (such a file would never need new > modules, and it might avoid some confusion between the tools and the > parameters needed to use the tools)? Well, gnu/packages/machine-learning-data.scm or s/data/models sounds good to me. Cheers, simon