Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
@ 2023-04-07  5:50 Nathan Dehnel
  2023-04-07  9:42 ` Simon Tournier
  2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge
  0 siblings, 2 replies; 8+ messages in thread
From: Nathan Dehnel @ 2023-04-07  5:50 UTC (permalink / raw)
  To: rprior, guix-devel

I am uncomfortable with including ML models without their training
data available. It is possible to hide backdoors in them.
https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-07  5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel
@ 2023-04-07  9:42 ` Simon Tournier
  2023-04-08 10:21   ` Nathan Dehnel
  2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge
  1 sibling, 1 reply; 8+ messages in thread
From: Simon Tournier @ 2023-04-07  9:42 UTC (permalink / raw)
  To: Nathan Dehnel, rprior, guix-devel

Hi,

On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote:

> I am uncomfortable with including ML models without their training
> data available. It is possible to hide backdoors in them.
> https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/

Thanks for pointing this article!  And some non-mathematical part of the
original article [1] are also worth to give a look. :-)

First please note that we are somehow in the case “The Open Box”, IMHO:

        But what if a company knows exactly what kind of model it wants,
        and simply lacks the computational resources to train it? Such a
        company would specify what network architecture and training
        procedure to use, and it would examine the trained model
        closely.

And yeah there is nothing new ;-) when one says that the result could be
biased by the person that produced the data.  Yeah, we have to trust the
trainer as we are trusting the people who generated “biased” (*) genomic
references.

Well, it is very interesting – and scary – to see how to theoretically
exploit “misclassify adversarial examples“ as described e.g. by [2].

This raises questions about “Verifiable Delegation of Learning”.

From my point of view, the tackle of such biased weights is not via
re-learning because how to draw the line between biased weights,
mistakes on their side, mistakes on our side, etc. and it requires a
high level of expertise to complete a full re-learning.  Instead, it
should come from the ML community that should standardize formal methods
for verifying that the training had not been biased, IMHO.

2: https://arxiv.org/abs/1412.6572

(*) biased genomic references, for one example among many others:

        Relatedly, reports have persisted of major artifacts that arise
        when identifying variants relative to GRCh38, such as an
        apparent imbalance between insertions and deletions (indels)
        arising from systematic mis-assemblies in GRCh38
        [15–17]. Overall, these errors and omissions in GRCh38 introduce
        biases in genomic analyses, particularly in centromeres,
        satellites, and other complex regions.

        https://doi.org/10.1101/2021.07.12.452063

Cheers,
simon

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-07  9:42 ` Simon Tournier
@ 2023-04-08 10:21   ` Nathan Dehnel
  2023-04-11  8:37     ` Simon Tournier
  0 siblings, 1 reply; 8+ messages in thread
From: Nathan Dehnel @ 2023-04-08 10:21 UTC (permalink / raw)
  To: Simon Tournier; +Cc: rprior, guix-devel

>From my point of view, the tackle of such biased weights is not via
re-learning because how to draw the line between biased weights,
mistakes on their side, mistakes on our side, etc. and it requires a
high level of expertise to complete a full re-learning.
This strikes me as similar to being in the 80s, when Stallman was
writing the GPL, years before Nix was invented, and saying "the
solution to backdoors in executables is not access to source code due
to the difficulty of compiling from scratch for the average user and
due to the difficulty of making bit-reproducible binaries." Like, bit
reproducibility WAS possible, it was just difficult, so practically
speaking users had to use distro binaries they couldn't fully trust.
So some of the benefits of the source code being available were rather
theoretical for a while. So this argument strikes me as pre-emptively
compromising one's principles based on the presumption that a new
technology will never come along that allows one to practically
exploit the benefits of said principles.

>Instead, it
should come from the ML community that should standardize formal methods
for verifying that the training had not been biased, IMHO.
What "formal methods" for that are known? As per the article, the
hiding of the backdoor in the "whitebox" scenario is cryptographically
secure in the specific case, with that same possibility open for the
general case.

On Fri, Apr 7, 2023 at 5:53 AM Simon Tournier <zimon.toutoune@gmail.com> wrote:
>
> Hi,
>
> On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote:
>
> > I am uncomfortable with including ML models without their training
> > data available. It is possible to hide backdoors in them.
> > https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/
>
> Thanks for pointing this article!  And some non-mathematical part of the
> original article [1] are also worth to give a look. :-)
>
> First please note that we are somehow in the case “The Open Box”, IMHO:
>
>         But what if a company knows exactly what kind of model it wants,
>         and simply lacks the computational resources to train it? Such a
>         company would specify what network architecture and training
>         procedure to use, and it would examine the trained model
>         closely.
>
> And yeah there is nothing new ;-) when one says that the result could be
> biased by the person that produced the data.  Yeah, we have to trust the
> trainer as we are trusting the people who generated “biased” (*) genomic
> references.
>
> Well, it is very interesting – and scary – to see how to theoretically
> exploit “misclassify adversarial examples“ as described e.g. by [2].
>
> This raises questions about “Verifiable Delegation of Learning”.
>
> From my point of view, the tackle of such biased weights is not via
> re-learning because how to draw the line between biased weights,
> mistakes on their side, mistakes on our side, etc. and it requires a
> high level of expertise to complete a full re-learning.  Instead, it
> should come from the ML community that should standardize formal methods
> for verifying that the training had not been biased, IMHO.
>
> 2: https://arxiv.org/abs/1412.6572
>
> (*) biased genomic references, for one example among many others:
>
>         Relatedly, reports have persisted of major artifacts that arise
>         when identifying variants relative to GRCh38, such as an
>         apparent imbalance between insertions and deletions (indels)
>         arising from systematic mis-assemblies in GRCh38
>         [15–17]. Overall, these errors and omissions in GRCh38 introduce
>         biases in genomic analyses, particularly in centromeres,
>         satellites, and other complex regions.
>
>         https://doi.org/10.1101/2021.07.12.452063
>
>
> Cheers,
> simon


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-08 10:21   ` Nathan Dehnel
@ 2023-04-11  8:37     ` Simon Tournier
  2023-04-11 12:41       ` Nathan Dehnel
  0 siblings, 1 reply; 8+ messages in thread
From: Simon Tournier @ 2023-04-11  8:37 UTC (permalink / raw)
  To: Nathan Dehnel; +Cc: rprior, guix-devel

Hi Nathan,

Maybe there is a misunderstanding. :-)

The subject is “Guideline for pre-trained ML model weight binaries”.  My
opinion on such guideline would to only consider the license of such
data.  Other considerations appear to me hard to be conclusive.


What I am trying to express is that:

 1) Bit-identical rebuild is worth, for sure!, and it addresses a class
    of attacks (e.g., Trusting trust described in 1984 [1]).  Aside, I
    find this message by John Gilmore [2] very instructive about the
    history of bit-identical rebuilds. (Bit-identical rebuild had been
    considered by GNU in the early 90’s.)

 2) Bit-identical rebuild is *not* the solution to all.  Obviously.
    Many attacks are bit-identical.  Consider the package
    ’python-pillow’, it builds bit-identically.  But before c16add7fd9,
    it was subject to CVE-2022-45199.  Only an human expertise to
    produce the patch [3] protects against the attack.

Considering this, I am claiming that:

 a) Bit-identical re-train of ML models is similar to #2; other said
    that bit-identical re-training of ML model weights does not protect
    much against biased training.  The only protection against biased
    training is by human expertise.

    Note that if the re-train is not bit-identical, what would be the
    conclusion about the trust?  It falls under the cases of non
    bit-identical rebuild of packages as Julia or even Guile itself.

 b) The resources (human, financial, hardware, etc.) for re-training is,
    for most of the cases, not affordable.  Not because it would be
    difficult or because the task is complex, this is covered by the
    point a), no it is because the requirements in term of resources is
    just to high.

    Consider that, for some cases where we do not have the resources, we
    already do not debootstrap.  See GHC compiler (*) or Genomic
    references.  And I am not saying it is impossible or we should not
    try, instead, I am saying we have to be pragmatic for some cases.


Therefore, my opinion is that pre-trained ML model weight binaries
should be included as any other data and the lack of debootstrapping is
not an issue for inclusion in this particular cases.

The question for inclusion about this pre-trained ML model binary
weights is the license.

Last, from my point of view, the tangential question is the size of such
pre-trained ML model binary weights.  I do not know if they fit the
store.

Well, that’s my opinion on this “Guidelines for pre-trained ML model
weight binaries”. :-)



(*) And Ricardo is training hard! See [4] and part 2 is yet published,
IIRC.

1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html
3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch
4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html

Cheers,
simon


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-11  8:37     ` Simon Tournier
@ 2023-04-11 12:41       ` Nathan Dehnel
  2023-04-12  9:32         ` Csepp
  0 siblings, 1 reply; 8+ messages in thread
From: Nathan Dehnel @ 2023-04-11 12:41 UTC (permalink / raw)
  To: Simon Tournier; +Cc: rprior, guix-devel

 a) Bit-identical re-train of ML models is similar to #2; other said
    that bit-identical re-training of ML model weights does not protect
    much against biased training.  The only protection against biased
    training is by human expertise.

Yeah, I didn't mean to give the impression that I thought
bit-reproducibility was the silver bullet for AI backdoors with that
analogy. I guess my argument is this: if they release the training
info, either 1) it does not produce the bias/backdoor of the trained
model, so there's no problem, or 2) it does, in which case an expert
will be able to look at it and go "wait, that's not right", and will
raise an alarm, and it will go public. The expert does not need to be
affiliated with guix, but guix will eventually hear about it. Similar
to how a normal security vulnerability works.

 b) The resources (human, financial, hardware, etc.) for re-training is,
    for most of the cases, not affordable.  Not because it would be
    difficult or because the task is complex, this is covered by the
    point a), no it is because the requirements in term of resources is
    just to high.

Maybe distributed substitutes could change that equation?

On Tue, Apr 11, 2023 at 3:37 AM Simon Tournier <zimon.toutoune@gmail.com> wrote:
>
> Hi Nathan,
>
> Maybe there is a misunderstanding. :-)
>
> The subject is “Guideline for pre-trained ML model weight binaries”.  My
> opinion on such guideline would to only consider the license of such
> data.  Other considerations appear to me hard to be conclusive.
>
>
> What I am trying to express is that:
>
>  1) Bit-identical rebuild is worth, for sure!, and it addresses a class
>     of attacks (e.g., Trusting trust described in 1984 [1]).  Aside, I
>     find this message by John Gilmore [2] very instructive about the
>     history of bit-identical rebuilds. (Bit-identical rebuild had been
>     considered by GNU in the early 90’s.)
>
>  2) Bit-identical rebuild is *not* the solution to all.  Obviously.
>     Many attacks are bit-identical.  Consider the package
>     ’python-pillow’, it builds bit-identically.  But before c16add7fd9,
>     it was subject to CVE-2022-45199.  Only an human expertise to
>     produce the patch [3] protects against the attack.
>
> Considering this, I am claiming that:
>
>  a) Bit-identical re-train of ML models is similar to #2; other said
>     that bit-identical re-training of ML model weights does not protect
>     much against biased training.  The only protection against biased
>     training is by human expertise.
>
>     Note that if the re-train is not bit-identical, what would be the
>     conclusion about the trust?  It falls under the cases of non
>     bit-identical rebuild of packages as Julia or even Guile itself.
>
>  b) The resources (human, financial, hardware, etc.) for re-training is,
>     for most of the cases, not affordable.  Not because it would be
>     difficult or because the task is complex, this is covered by the
>     point a), no it is because the requirements in term of resources is
>     just to high.
>
>     Consider that, for some cases where we do not have the resources, we
>     already do not debootstrap.  See GHC compiler (*) or Genomic
>     references.  And I am not saying it is impossible or we should not
>     try, instead, I am saying we have to be pragmatic for some cases.
>
>
> Therefore, my opinion is that pre-trained ML model weight binaries
> should be included as any other data and the lack of debootstrapping is
> not an issue for inclusion in this particular cases.
>
> The question for inclusion about this pre-trained ML model binary
> weights is the license.
>
> Last, from my point of view, the tangential question is the size of such
> pre-trained ML model binary weights.  I do not know if they fit the
> store.
>
> Well, that’s my opinion on this “Guidelines for pre-trained ML model
> weight binaries”. :-)
>
>
>
> (*) And Ricardo is training hard! See [4] and part 2 is yet published,
> IIRC.
>
> 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
> 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html
> 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch
> 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html
>
> Cheers,
> simon


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-11 12:41       ` Nathan Dehnel
@ 2023-04-12  9:32         ` Csepp
  0 siblings, 0 replies; 8+ messages in thread
From: Csepp @ 2023-04-12  9:32 UTC (permalink / raw)
  To: Nathan Dehnel; +Cc: Simon Tournier, rprior, guix-devel


Nathan Dehnel <ncdehnel@gmail.com> writes:

>  a) Bit-identical re-train of ML models is similar to #2; other said
>     that bit-identical re-training of ML model weights does not protect
>     much against biased training.  The only protection against biased
>     training is by human expertise.
>
> Yeah, I didn't mean to give the impression that I thought
> bit-reproducibility was the silver bullet for AI backdoors with that
> analogy. I guess my argument is this: if they release the training
> info, either 1) it does not produce the bias/backdoor of the trained
> model, so there's no problem, or 2) it does, in which case an expert
> will be able to look at it and go "wait, that's not right", and will
> raise an alarm, and it will go public. The expert does not need to be
> affiliated with guix, but guix will eventually hear about it. Similar
> to how a normal security vulnerability works.
>
>  b) The resources (human, financial, hardware, etc.) for re-training is,
>     for most of the cases, not affordable.  Not because it would be
>     difficult or because the task is complex, this is covered by the
>     point a), no it is because the requirements in term of resources is
>     just to high.
>
> Maybe distributed substitutes could change that equation?

Probably not, it would require distributed *builds*.  Right now Guix
can't even use distcc, so it definitely can't use remote GPUs.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries
  2023-04-07  5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel
  2023-04-07  9:42 ` Simon Tournier
@ 2023-09-06 14:28 ` Andreas Enge
  1 sibling, 0 replies; 8+ messages in thread
From: Andreas Enge @ 2023-09-06 14:28 UTC (permalink / raw)
  To: Nathan Dehnel; +Cc: rprior, guix-devel

Hello,

related to this thread, I just came across an entry in Cory Doctorow's blog:
   https://pluralistic.net/2023/08/18/openwashing/#you-keep-using-that-word-i-do-not-think-it-means-what-you-think-it-means

It is already interesting in its disection of the terms "open" vs. "free",
which is quite relevant to us (but just echoes the sentiment I had anyway).
The end can be seen as an invitation to *not* package neurol network
related software at all: by packaging "big corporation X"'s free software,
but which is untrainable on anything but big corporations' hardware, we
actually help big corporation X to entrap users into its "ecosystem".

Andreas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries
@ 2023-09-12  7:36 Nathan Dehnel
  0 siblings, 0 replies; 8+ messages in thread
From: Nathan Dehnel @ 2023-09-12  7:36 UTC (permalink / raw)
  To: Andreas Enge, guix-devel

That was fascinating, thanks for sharing.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-12  7:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-07  5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel
2023-04-07  9:42 ` Simon Tournier
2023-04-08 10:21   ` Nathan Dehnel
2023-04-11  8:37     ` Simon Tournier
2023-04-11 12:41       ` Nathan Dehnel
2023-04-12  9:32         ` Csepp
2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge
  -- strict thread matches above, loose matches on Subject: below --
2023-09-12  7:36 Nathan Dehnel

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).