From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp10.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id WPwdKbhVNWTkSwEASxT56A (envelope-from ) for ; Tue, 11 Apr 2023 14:42:32 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp10.migadu.com with LMTPS id gGVCKLhVNWSXjwAAG6o9tA (envelope-from ) for ; Tue, 11 Apr 2023 14:42:32 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 28AEF13168 for ; Tue, 11 Apr 2023 14:42:32 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pmDJv-0002hB-2q; Tue, 11 Apr 2023 08:41:59 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pmDJt-0002h2-SK for guix-devel@gnu.org; Tue, 11 Apr 2023 08:41:57 -0400 Received: from mail-oi1-x236.google.com ([2607:f8b0:4864:20::236]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pmDJr-0002Xb-BR for guix-devel@gnu.org; Tue, 11 Apr 2023 08:41:57 -0400 Received: by mail-oi1-x236.google.com with SMTP id bl22so23277250oib.11 for ; Tue, 11 Apr 2023 05:41:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1681216913; x=1683808913; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PFotLF3NvUZUIQrTWxRML8/SUHaFPy3/jhO2ptBAmLU=; b=j88opFbxHVAUVmXvGdpnLsjJIq0Fg9+ynflOunu9o6dTMOJdB6id/MtllB/GnnGPbA OWjwS7DmrZ1DLrTirRFNZtU4O50KKq4GQaCuWW/VOzGmAt8QCgyiqBMlXSj3jsfAtAJ4 4GF8kZ0sSqT+o3PcObbMkcDjJ7N6G3q/18Q4NNDcBVFDHKyj0GbcLKlUSxNT2J3ClZLU PvRSTZcrutkUdnGfLQ89vqSAyRWQVfO/D1MeeP2METcBswEzahPDuDTH7JbVTWbYYnjG r7LX/n5Ek7K/GSLOwd4qW5oQFk7bQyFitIJ/pjCzM2LLWmi2vHLUWf9wrWWTlS/ccmgw +v4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1681216913; x=1683808913; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PFotLF3NvUZUIQrTWxRML8/SUHaFPy3/jhO2ptBAmLU=; b=ykfQAy5UFOoiA41aNu6PnuLmi52WnCLYSastjU+sZDcsjXs9TyCWISnXB3lnIqnkp3 Gpbg3iurnWFv++9LceLY4JpPDel1dcv+8PaC9Vt5MfLrKCDwkP1E/EGTDEoPte4BNkRO qDPL5srteK5o84uzqfZUy/+RemmemH2ta6oqtqm/JVk5Gp2oofzIMcWU3IPcVUBekZSc /9eyQ0+daXX8ikd+P25l6/Di9UA0Gwm8ODuyeCGhBQS+5wB+04uusvM7JeE1+ezaZNFv 8vXGdLF/C7ZRtnNvvfuV4UT0lzaAy8oIPwBwyvKfhz64ev6z2jRfLSrxpUCqgdqDHw3t BcQQ== X-Gm-Message-State: AAQBX9fMH68+bsgj5Lj+GJ7j8LTEon089pMxuh2cGsWvNXMEgiGToOVR T3sD3bTsvWZ9/W5LxMIrIi4VTr8FAHoQIe+WRrE= X-Google-Smtp-Source: AKy350YyOSCrpuzvDVsjRcRPTEwsl74vJKPLStOcvYR3VdbV2MM6OTc46l91ufV5Dxk/2sOACfyRiBShqvqjiNv4KxQ= X-Received: by 2002:a05:6808:6d3:b0:38b:c1ae:cb0b with SMTP id m19-20020a05680806d300b0038bc1aecb0bmr1927868oih.4.1681216913109; Tue, 11 Apr 2023 05:41:53 -0700 (PDT) MIME-Version: 1.0 References: <867cui6ci3.fsf@gmail.com> In-Reply-To: <867cui6ci3.fsf@gmail.com> From: Nathan Dehnel Date: Tue, 11 Apr 2023 07:41:42 -0500 Message-ID: Subject: Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) To: Simon Tournier Cc: rprior@protonmail.com, guix-devel@gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2607:f8b0:4864:20::236; envelope-from=ncdehnel@gmail.com; helo=mail-oi1-x236.google.com X-Spam_score_int: 0 X-Spam_score: -0.1 X-Spam_bar: / X-Spam_report: (-0.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URI_DOTEDU=1.999 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN ARC-Seal: i=1; s=key1; d=yhetil.org; t=1681216952; a=rsa-sha256; cv=none; b=deZvljFmqMaEjVHNF+HtTqcQkdeBe6VZUu1dPn8slz/x3siZ05DThCjB3voF0Ba1ZmWUxn 9kcfX2ezw632uC554coTeHfnK8wYNieAeguJ6nzrqLob7AQMJ4ojx7eQ0rZQAno37QnTIU +KeOKsvTzzdpaSstbd44zeNSBoNBjetRziaXmd4sYEXlivDw6vQ+CmjtclLq2d9M12hcf3 dQ9yJVXKDgl/xR+mCQVWkY06P/Zd0h+6TTFADlKu0CTJc6yjiKvyKixr7PwaJ4smKdri2m AGltxe8j7xf5fvY3HvdzlSQ6NZcwwLUFU09JUvlUFzZyErtih1m9+07xdjxbAQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=j88opFbx; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1681216952; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=PFotLF3NvUZUIQrTWxRML8/SUHaFPy3/jhO2ptBAmLU=; b=Y6UnHjmuVfjg9+jVcWZfF2TGVIMNuXhUPFS/dPu8lG0O2aOJv4eJpWW4VgUUpSILT0eMhs V3eiX2d/TULhpHxVDo+GRSA1GDZhNQmqgu4Kx6e+cMz+cEWgJCkBtdJwMlQMYexT7+S2gM LcA6IdqVGAU0mkLWll1wzo1p+S7YCL8W41pltXUUloBLNyOfuT3NIAoJRzNBbi4EUKTf3O /xKxX7ggyEfqZUxaRIo1Ku52M29KXjgehzFQbaP8CbpXc4EYDh8IXg3bYjpv1A91kr2O7g lFBGz8S2NsECG6IwuyylgOGn7nSmVCOEf/mEr1nET+KOCOwPPBa0i3XVg+3Rog== X-Migadu-Spam-Score: 0.55 X-Migadu-Scanner: scn1.migadu.com Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=j88opFbx; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Spam-Score: 0.55 X-Migadu-Queue-Id: 28AEF13168 X-TUID: oeTMsR4dQF1W a) Bit-identical re-train of ML models is similar to #2; other said that bit-identical re-training of ML model weights does not protect much against biased training. The only protection against biased training is by human expertise. Yeah, I didn't mean to give the impression that I thought bit-reproducibility was the silver bullet for AI backdoors with that analogy. I guess my argument is this: if they release the training info, either 1) it does not produce the bias/backdoor of the trained model, so there's no problem, or 2) it does, in which case an expert will be able to look at it and go "wait, that's not right", and will raise an alarm, and it will go public. The expert does not need to be affiliated with guix, but guix will eventually hear about it. Similar to how a normal security vulnerability works. b) The resources (human, financial, hardware, etc.) for re-training is, for most of the cases, not affordable. Not because it would be difficult or because the task is complex, this is covered by the point a), no it is because the requirements in term of resources is just to high. Maybe distributed substitutes could change that equation? On Tue, Apr 11, 2023 at 3:37=E2=80=AFAM Simon Tournier wrote: > > Hi Nathan, > > Maybe there is a misunderstanding. :-) > > The subject is =E2=80=9CGuideline for pre-trained ML model weight binarie= s=E2=80=9D. My > opinion on such guideline would to only consider the license of such > data. Other considerations appear to me hard to be conclusive. > > > What I am trying to express is that: > > 1) Bit-identical rebuild is worth, for sure!, and it addresses a class > of attacks (e.g., Trusting trust described in 1984 [1]). Aside, I > find this message by John Gilmore [2] very instructive about the > history of bit-identical rebuilds. (Bit-identical rebuild had been > considered by GNU in the early 90=E2=80=99s.) > > 2) Bit-identical rebuild is *not* the solution to all. Obviously. > Many attacks are bit-identical. Consider the package > =E2=80=99python-pillow=E2=80=99, it builds bit-identically. But befo= re c16add7fd9, > it was subject to CVE-2022-45199. Only an human expertise to > produce the patch [3] protects against the attack. > > Considering this, I am claiming that: > > a) Bit-identical re-train of ML models is similar to #2; other said > that bit-identical re-training of ML model weights does not protect > much against biased training. The only protection against biased > training is by human expertise. > > Note that if the re-train is not bit-identical, what would be the > conclusion about the trust? It falls under the cases of non > bit-identical rebuild of packages as Julia or even Guile itself. > > b) The resources (human, financial, hardware, etc.) for re-training is, > for most of the cases, not affordable. Not because it would be > difficult or because the task is complex, this is covered by the > point a), no it is because the requirements in term of resources is > just to high. > > Consider that, for some cases where we do not have the resources, we > already do not debootstrap. See GHC compiler (*) or Genomic > references. And I am not saying it is impossible or we should not > try, instead, I am saying we have to be pragmatic for some cases. > > > Therefore, my opinion is that pre-trained ML model weight binaries > should be included as any other data and the lack of debootstrapping is > not an issue for inclusion in this particular cases. > > The question for inclusion about this pre-trained ML model binary > weights is the license. > > Last, from my point of view, the tangential question is the size of such > pre-trained ML model binary weights. I do not know if they fit the > store. > > Well, that=E2=80=99s my opinion on this =E2=80=9CGuidelines for pre-train= ed ML model > weight binaries=E2=80=9D. :-) > > > > (*) And Ricardo is training hard! See [4] and part 2 is yet published, > IIRC. > > 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Reflectionson= TrustingTrust.pdf > 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-Januar= y/000309.html > 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/p= ython-pillow-CVE-2022-45199.patch > 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html > > Cheers, > simon