From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id 6JqoBlQcNWS4MAEASxT56A (envelope-from ) for ; Tue, 11 Apr 2023 10:37:40 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id sA+eBlQcNWRnawEAauVa8A (envelope-from ) for ; Tue, 11 Apr 2023 10:37:40 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id CE68BACA5 for ; Tue, 11 Apr 2023 10:37:39 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pm9V8-0007Rl-2S; Tue, 11 Apr 2023 04:37:18 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pm9V6-0007RT-GJ for guix-devel@gnu.org; Tue, 11 Apr 2023 04:37:16 -0400 Received: from mail-wm1-x332.google.com ([2a00:1450:4864:20::332]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pm9V4-00088V-Hy for guix-devel@gnu.org; Tue, 11 Apr 2023 04:37:16 -0400 Received: by mail-wm1-x332.google.com with SMTP id 5b1f17b1804b1-3f080f53fc6so3025855e9.0 for ; Tue, 11 Apr 2023 01:37:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1681202232; x=1683794232; h=content-transfer-encoding:mime-version:message-id:date:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=3Y/QdHemKi6vzpfXp8mhVtMOSS6tiFAozSQ7IBLsDro=; b=e30BjRlL4Ly7EfGR+7TMUEqfT2PtBlA34Hv3jv+IuVITaqbJZ1/hSasxQVyQGhvLkD Nis3y3EmJKlcVgwlBW3UGajbptup3dYD6gK1VcgyeWj6JtpaGKJbNyY7ncy8RU6rMX6t HWgXbO0Fi/eIx8bS4+AKJwiHBU9l0EwtGSczWEBpNL963KDdDUIxshTFXSVCE23Gyrl5 qCkd8yzD4qPloEpGw3SUKpO9bUqbhCwJiGK2M/dRD0he58Czx6GbzfZ99vThdGiYr833 vbrtiYX99BntRM9iOOLfaPgUQc3UgZYHS8HI8CXtRaEUEVcSWpSp55fQ5oCc7Q/WCe3B 0Qzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1681202232; x=1683794232; h=content-transfer-encoding:mime-version:message-id:date:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=3Y/QdHemKi6vzpfXp8mhVtMOSS6tiFAozSQ7IBLsDro=; b=5DBpKzZx8vSvX6TLjAQ/bK63rTzJA8T4ytWifymOtpcvIIAcjX/brOeOA/aWJf4Df2 puMkc3E09qb/UN9pvGZ3haNPe5TnuwBtqjnkFMTkAW5sQPMUZhHt+pWlolClxvM83ZQW p/ySKEKDu4r5K8pGT66VnSibtJtCDD6JjXcU7yiVoyJ+0MRGvNMdSouJ8HrN6Isv/DGr 7zS/Tw8kn3MRGhd8zaNkoGEJgaV8BR0EDGgtw0eddqTG2r1KUjQd8EJ3v0s7SJeRIVON ljB+vAqlE7DWSPpAEnGYj7VNYnE66CJlClURfZRrOcUHbVwhrNgOh1BpHSmFjOS6/SEO rNoA== X-Gm-Message-State: AAQBX9cN81THfwRuZTItJsSFM9WXCiBjXXK6WUTs47ogVcAdKK2ICbJF 9B3RX0Es6G55lXmL+Kd40tjvFRTYKJw= X-Google-Smtp-Source: AKy350ZnlxJtlNERT19k32ehauciHKTsWPEU6hC7x2l98ISrGaDX8wDtZxo+88NRTQ2P4z5mf3eZiw== X-Received: by 2002:a05:600c:3ca3:b0:3eb:2e2a:be95 with SMTP id bg35-20020a05600c3ca300b003eb2e2abe95mr7432000wmb.2.1681202232490; Tue, 11 Apr 2023 01:37:12 -0700 (PDT) Received: from lili ([88.126.110.68]) by smtp.gmail.com with ESMTPSA id iv11-20020a05600c548b00b003ed29189777sm20471880wmb.47.2023.04.11.01.37.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Apr 2023 01:37:12 -0700 (PDT) From: Simon Tournier To: Nathan Dehnel Cc: rprior@protonmail.com, guix-devel@gnu.org Subject: Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) In-Reply-To: Date: Tue, 11 Apr 2023 10:37:08 +0200 Message-ID: <867cui6ci3.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::332; envelope-from=zimon.toutoune@gmail.com; helo=mail-wm1-x332.google.com X-Spam_score_int: 0 X-Spam_score: -0.1 X-Spam_bar: / X-Spam_report: (-0.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URI_DOTEDU=1.999 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN ARC-Seal: i=1; s=key1; d=yhetil.org; t=1681202259; a=rsa-sha256; cv=none; b=u6GPNOLhIBsPUe7lpbE15tfL7BFOVLGcs+KqFRrAfG/eErCY7BRoS+Jzp/YXD7P0SitTvb nGJcQBUVidLoa5qnkbIhpAus/2Dx3eoREM/Y5bDanF0DjLOpLOlGosGA2U4Kb6fe1b/OZQ rsGR3uUz0ENQiiSVgZYBOm7Yay3lz5lAEA4AsVKqt5wuJISH7KGZwEtWcpHyOayE7Bn1j2 nd7yA9uLiYi3SGtW6N69BRJFh2eLl7vlhKPMTjKlIinD41z8cA8K7jpM9hthUkv48myTjj cpfFlIUdqecd80lH3g8f7W++CL9P8zN7fuUeA5JF/3wwbdINkhA1kOVieuv5gQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=e30BjRlL; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1681202259; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=3Y/QdHemKi6vzpfXp8mhVtMOSS6tiFAozSQ7IBLsDro=; b=gcfxzyZe6fwI/lbKMXn4xL1ILM9gQXrFLQZ0BLy13tOcd3D070gIFFgERBQz2BwfFbrbwb 0/t3jwmwYfsbnGowGUarCL0oPkQUxscX+pu2AoMDjv7zB3g84rf/+Mhx+6Lw0QATTHtA1f 3dG4I1e991b6btQcBOcxYRULJImM7zsC75CbYzTRQ83M/Tm3A/292DzJg+F4atBpywywuU w+ukBcA+w3LU06Su2y4SRpHX7hVjweDE6pKyyy1ZjWACAoPXKxKzHqSIzPwP5QTHqr8vho /2WEBZ5c3hqqshrGEChx1oCK18niCHj/qePGAmSf5lVm9jYvxp+Ki7sRxsuZFQ== Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=e30BjRlL; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Migadu-Scanner: scn0.migadu.com X-Migadu-Spam-Score: -1.05 X-Spam-Score: -1.05 X-Migadu-Queue-Id: CE68BACA5 X-TUID: wjSZt+IpS9l/ Hi Nathan, Maybe there is a misunderstanding. :-) The subject is =E2=80=9CGuideline for pre-trained ML model weight binaries= =E2=80=9D. My opinion on such guideline would to only consider the license of such data. Other considerations appear to me hard to be conclusive. What I am trying to express is that: 1) Bit-identical rebuild is worth, for sure!, and it addresses a class of attacks (e.g., Trusting trust described in 1984 [1]). Aside, I find this message by John Gilmore [2] very instructive about the history of bit-identical rebuilds. (Bit-identical rebuild had been considered by GNU in the early 90=E2=80=99s.) 2) Bit-identical rebuild is *not* the solution to all. Obviously. Many attacks are bit-identical. Consider the package =E2=80=99python-pillow=E2=80=99, it builds bit-identically. But before= c16add7fd9, it was subject to CVE-2022-45199. Only an human expertise to produce the patch [3] protects against the attack. Considering this, I am claiming that: a) Bit-identical re-train of ML models is similar to #2; other said that bit-identical re-training of ML model weights does not protect much against biased training. The only protection against biased training is by human expertise. Note that if the re-train is not bit-identical, what would be the conclusion about the trust? It falls under the cases of non bit-identical rebuild of packages as Julia or even Guile itself. b) The resources (human, financial, hardware, etc.) for re-training is, for most of the cases, not affordable. Not because it would be difficult or because the task is complex, this is covered by the point a), no it is because the requirements in term of resources is just to high. Consider that, for some cases where we do not have the resources, we already do not debootstrap. See GHC compiler (*) or Genomic references. And I am not saying it is impossible or we should not try, instead, I am saying we have to be pragmatic for some cases. Therefore, my opinion is that pre-trained ML model weight binaries should be included as any other data and the lack of debootstrapping is not an issue for inclusion in this particular cases. The question for inclusion about this pre-trained ML model binary weights is the license. Last, from my point of view, the tangential question is the size of such pre-trained ML model binary weights. I do not know if they fit the store. Well, that=E2=80=99s my opinion on this =E2=80=9CGuidelines for pre-trained= ML model weight binaries=E2=80=9D. :-) (*) And Ricardo is training hard! See [4] and part 2 is yet published, IIRC. 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTr= ustingTrust.pdf 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/= 000309.html 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/pyt= hon-pillow-CVE-2022-45199.patch 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html Cheers, simon