Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Nathan Dehnel <ncdehnel@gmail.com>
To: Simon Tournier <zimon.toutoune@gmail.com>
Cc: rprior@protonmail.com, guix-devel@gnu.org
Subject: Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
Date: Sat, 8 Apr 2023 05:21:27 -0500	[thread overview]
Message-ID: <CAEEhgEtBDE5XxHSgWitOWbhFTu4Q=bv=0gMQud6eNXBQ3CEBeA@mail.gmail.com> (raw)
In-Reply-To: <87sfdckp05.fsf@gmail.com>

>From my point of view, the tackle of such biased weights is not via
re-learning because how to draw the line between biased weights,
mistakes on their side, mistakes on our side, etc. and it requires a
high level of expertise to complete a full re-learning.
This strikes me as similar to being in the 80s, when Stallman was
writing the GPL, years before Nix was invented, and saying "the
solution to backdoors in executables is not access to source code due
to the difficulty of compiling from scratch for the average user and
due to the difficulty of making bit-reproducible binaries." Like, bit
reproducibility WAS possible, it was just difficult, so practically
speaking users had to use distro binaries they couldn't fully trust.
So some of the benefits of the source code being available were rather
theoretical for a while. So this argument strikes me as pre-emptively
compromising one's principles based on the presumption that a new
technology will never come along that allows one to practically
exploit the benefits of said principles.

>Instead, it
should come from the ML community that should standardize formal methods
for verifying that the training had not been biased, IMHO.
What "formal methods" for that are known? As per the article, the
hiding of the backdoor in the "whitebox" scenario is cryptographically
secure in the specific case, with that same possibility open for the
general case.

On Fri, Apr 7, 2023 at 5:53 AM Simon Tournier <zimon.toutoune@gmail.com> wrote:
>
> Hi,
>
> On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote:
>
> > I am uncomfortable with including ML models without their training
> > data available. It is possible to hide backdoors in them.
> > https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/
>
> Thanks for pointing this article!  And some non-mathematical part of the
> original article [1] are also worth to give a look. :-)
>
> First please note that we are somehow in the case “The Open Box”, IMHO:
>
>         But what if a company knows exactly what kind of model it wants,
>         and simply lacks the computational resources to train it? Such a
>         company would specify what network architecture and training
>         procedure to use, and it would examine the trained model
>         closely.
>
> And yeah there is nothing new ;-) when one says that the result could be
> biased by the person that produced the data.  Yeah, we have to trust the
> trainer as we are trusting the people who generated “biased” (*) genomic
> references.
>
> Well, it is very interesting – and scary – to see how to theoretically
> exploit “misclassify adversarial examples“ as described e.g. by [2].
>
> This raises questions about “Verifiable Delegation of Learning”.
>
> From my point of view, the tackle of such biased weights is not via
> re-learning because how to draw the line between biased weights,
> mistakes on their side, mistakes on our side, etc. and it requires a
> high level of expertise to complete a full re-learning.  Instead, it
> should come from the ML community that should standardize formal methods
> for verifying that the training had not been biased, IMHO.
>
> 2: https://arxiv.org/abs/1412.6572
>
> (*) biased genomic references, for one example among many others:
>
>         Relatedly, reports have persisted of major artifacts that arise
>         when identifying variants relative to GRCh38, such as an
>         apparent imbalance between insertions and deletions (indels)
>         arising from systematic mis-assemblies in GRCh38
>         [15–17]. Overall, these errors and omissions in GRCh38 introduce
>         biases in genomic analyses, particularly in centromeres,
>         satellites, and other complex regions.
>
>         https://doi.org/10.1101/2021.07.12.452063
>
>
> Cheers,
> simon

next prev parent reply	other threads:[~2023-04-08 10:22 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07  5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel
2023-04-07  9:42 ` Simon Tournier
2023-04-08 10:21   ` Nathan Dehnel [this message]
2023-04-11  8:37     ` Simon Tournier
2023-04-11 12:41       ` Nathan Dehnel
2023-04-12  9:32         ` Csepp
2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge
  -- strict thread matches above, loose matches on Subject: below --
2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior
2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
2023-04-03 21:18   ` Jack Hill
2023-04-06  8:42 ` Simon Tournier
2023-04-06 13:41   ` Kyle
2023-04-06 14:53     ` Simon Tournier
2023-05-13  4:13   ` 宋文武
2023-05-15 11:18     ` Simon Tournier
2023-05-26 15:37       ` Ludovic Courtès
2023-05-29  3:57         ` zamfofex
2023-05-30 13:15         ` Simon Tournier
2023-07-02 19:51           ` Ludovic Courtès
2023-07-03  9:39             ` Simon Tournier
2023-07-04 13:05               ` zamfofex
2023-07-04 20:03                 ` Vagrant Cascadian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEEhgEtBDE5XxHSgWitOWbhFTu4Q=bv=0gMQud6eNXBQ3CEBeA@mail.gmail.com' \
    --to=ncdehnel@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=rprior@protonmail.com \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).