Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Nathan Dehnel <ncdehnel@gmail.com>
To: Simon Tournier <zimon.toutoune@gmail.com>
Cc: rprior@protonmail.com, guix-devel@gnu.org
Subject: Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
Date: Tue, 11 Apr 2023 07:41:42 -0500	[thread overview]
Message-ID: <CAEEhgEtzQiDXUQk2+z5HYr9dV=dFzo0W_s7ROcryf9c=0A_-2g@mail.gmail.com> (raw)
In-Reply-To: <867cui6ci3.fsf@gmail.com>

 a) Bit-identical re-train of ML models is similar to #2; other said
    that bit-identical re-training of ML model weights does not protect
    much against biased training.  The only protection against biased
    training is by human expertise.

Yeah, I didn't mean to give the impression that I thought
bit-reproducibility was the silver bullet for AI backdoors with that
analogy. I guess my argument is this: if they release the training
info, either 1) it does not produce the bias/backdoor of the trained
model, so there's no problem, or 2) it does, in which case an expert
will be able to look at it and go "wait, that's not right", and will
raise an alarm, and it will go public. The expert does not need to be
affiliated with guix, but guix will eventually hear about it. Similar
to how a normal security vulnerability works.

 b) The resources (human, financial, hardware, etc.) for re-training is,
    for most of the cases, not affordable.  Not because it would be
    difficult or because the task is complex, this is covered by the
    point a), no it is because the requirements in term of resources is
    just to high.

Maybe distributed substitutes could change that equation?

On Tue, Apr 11, 2023 at 3:37 AM Simon Tournier <zimon.toutoune@gmail.com> wrote:
>
> Hi Nathan,
>
> Maybe there is a misunderstanding. :-)
>
> The subject is “Guideline for pre-trained ML model weight binaries”.  My
> opinion on such guideline would to only consider the license of such
> data.  Other considerations appear to me hard to be conclusive.
>
>
> What I am trying to express is that:
>
>  1) Bit-identical rebuild is worth, for sure!, and it addresses a class
>     of attacks (e.g., Trusting trust described in 1984 [1]).  Aside, I
>     find this message by John Gilmore [2] very instructive about the
>     history of bit-identical rebuilds. (Bit-identical rebuild had been
>     considered by GNU in the early 90’s.)
>
>  2) Bit-identical rebuild is *not* the solution to all.  Obviously.
>     Many attacks are bit-identical.  Consider the package
>     ’python-pillow’, it builds bit-identically.  But before c16add7fd9,
>     it was subject to CVE-2022-45199.  Only an human expertise to
>     produce the patch [3] protects against the attack.
>
> Considering this, I am claiming that:
>
>  a) Bit-identical re-train of ML models is similar to #2; other said
>     that bit-identical re-training of ML model weights does not protect
>     much against biased training.  The only protection against biased
>     training is by human expertise.
>
>     Note that if the re-train is not bit-identical, what would be the
>     conclusion about the trust?  It falls under the cases of non
>     bit-identical rebuild of packages as Julia or even Guile itself.
>
>  b) The resources (human, financial, hardware, etc.) for re-training is,
>     for most of the cases, not affordable.  Not because it would be
>     difficult or because the task is complex, this is covered by the
>     point a), no it is because the requirements in term of resources is
>     just to high.
>
>     Consider that, for some cases where we do not have the resources, we
>     already do not debootstrap.  See GHC compiler (*) or Genomic
>     references.  And I am not saying it is impossible or we should not
>     try, instead, I am saying we have to be pragmatic for some cases.
>
>
> Therefore, my opinion is that pre-trained ML model weight binaries
> should be included as any other data and the lack of debootstrapping is
> not an issue for inclusion in this particular cases.
>
> The question for inclusion about this pre-trained ML model binary
> weights is the license.
>
> Last, from my point of view, the tangential question is the size of such
> pre-trained ML model binary weights.  I do not know if they fit the
> store.
>
> Well, that’s my opinion on this “Guidelines for pre-trained ML model
> weight binaries”. :-)
>
>
>
> (*) And Ricardo is training hard! See [4] and part 2 is yet published,
> IIRC.
>
> 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
> 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html
> 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch
> 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html
>
> Cheers,
> simon

next prev parent reply	other threads:[~2023-04-11 12:42 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07  5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel
2023-04-07  9:42 ` Simon Tournier
2023-04-08 10:21   ` Nathan Dehnel
2023-04-11  8:37     ` Simon Tournier
2023-04-11 12:41       ` Nathan Dehnel [this message]
2023-04-12  9:32         ` Csepp
2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge
  -- strict thread matches above, loose matches on Subject: below --
2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior
2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
2023-04-03 21:18   ` Jack Hill
2023-04-06  8:42 ` Simon Tournier
2023-04-06 13:41   ` Kyle
2023-04-06 14:53     ` Simon Tournier
2023-05-13  4:13   ` 宋文武
2023-05-15 11:18     ` Simon Tournier
2023-05-26 15:37       ` Ludovic Courtès
2023-05-29  3:57         ` zamfofex
2023-05-30 13:15         ` Simon Tournier
2023-07-02 19:51           ` Ludovic Courtès
2023-07-03  9:39             ` Simon Tournier
2023-07-04 13:05               ` zamfofex
2023-07-04 20:03                 ` Vagrant Cascadian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEEhgEtzQiDXUQk2+z5HYr9dV=dFzo0W_s7ROcryf9c=0A_-2g@mail.gmail.com' \
    --to=ncdehnel@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=rprior@protonmail.com \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).