* Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) @ 2023-04-07 5:50 Nathan Dehnel 2023-04-07 9:42 ` Simon Tournier 2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge 0 siblings, 2 replies; 8+ messages in thread From: Nathan Dehnel @ 2023-04-07 5:50 UTC (permalink / raw) To: rprior, guix-devel I am uncomfortable with including ML models without their training data available. It is possible to hide backdoors in them. https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-07 5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel @ 2023-04-07 9:42 ` Simon Tournier 2023-04-08 10:21 ` Nathan Dehnel 2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge 1 sibling, 1 reply; 8+ messages in thread From: Simon Tournier @ 2023-04-07 9:42 UTC (permalink / raw) To: Nathan Dehnel, rprior, guix-devel Hi, On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote: > I am uncomfortable with including ML models without their training > data available. It is possible to hide backdoors in them. > https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/ Thanks for pointing this article! And some non-mathematical part of the original article [1] are also worth to give a look. :-) First please note that we are somehow in the case “The Open Box”, IMHO: But what if a company knows exactly what kind of model it wants, and simply lacks the computational resources to train it? Such a company would specify what network architecture and training procedure to use, and it would examine the trained model closely. And yeah there is nothing new ;-) when one says that the result could be biased by the person that produced the data. Yeah, we have to trust the trainer as we are trusting the people who generated “biased” (*) genomic references. Well, it is very interesting – and scary – to see how to theoretically exploit “misclassify adversarial examples“ as described e.g. by [2]. This raises questions about “Verifiable Delegation of Learning”. From my point of view, the tackle of such biased weights is not via re-learning because how to draw the line between biased weights, mistakes on their side, mistakes on our side, etc. and it requires a high level of expertise to complete a full re-learning. Instead, it should come from the ML community that should standardize formal methods for verifying that the training had not been biased, IMHO. 2: https://arxiv.org/abs/1412.6572 (*) biased genomic references, for one example among many others: Relatedly, reports have persisted of major artifacts that arise when identifying variants relative to GRCh38, such as an apparent imbalance between insertions and deletions (indels) arising from systematic mis-assemblies in GRCh38 [15–17]. Overall, these errors and omissions in GRCh38 introduce biases in genomic analyses, particularly in centromeres, satellites, and other complex regions. https://doi.org/10.1101/2021.07.12.452063 Cheers, simon ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-07 9:42 ` Simon Tournier @ 2023-04-08 10:21 ` Nathan Dehnel 2023-04-11 8:37 ` Simon Tournier 0 siblings, 1 reply; 8+ messages in thread From: Nathan Dehnel @ 2023-04-08 10:21 UTC (permalink / raw) To: Simon Tournier; +Cc: rprior, guix-devel >From my point of view, the tackle of such biased weights is not via re-learning because how to draw the line between biased weights, mistakes on their side, mistakes on our side, etc. and it requires a high level of expertise to complete a full re-learning. This strikes me as similar to being in the 80s, when Stallman was writing the GPL, years before Nix was invented, and saying "the solution to backdoors in executables is not access to source code due to the difficulty of compiling from scratch for the average user and due to the difficulty of making bit-reproducible binaries." Like, bit reproducibility WAS possible, it was just difficult, so practically speaking users had to use distro binaries they couldn't fully trust. So some of the benefits of the source code being available were rather theoretical for a while. So this argument strikes me as pre-emptively compromising one's principles based on the presumption that a new technology will never come along that allows one to practically exploit the benefits of said principles. >Instead, it should come from the ML community that should standardize formal methods for verifying that the training had not been biased, IMHO. What "formal methods" for that are known? As per the article, the hiding of the backdoor in the "whitebox" scenario is cryptographically secure in the specific case, with that same possibility open for the general case. On Fri, Apr 7, 2023 at 5:53 AM Simon Tournier <zimon.toutoune@gmail.com> wrote: > > Hi, > > On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote: > > > I am uncomfortable with including ML models without their training > > data available. It is possible to hide backdoors in them. > > https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/ > > Thanks for pointing this article! And some non-mathematical part of the > original article [1] are also worth to give a look. :-) > > First please note that we are somehow in the case “The Open Box”, IMHO: > > But what if a company knows exactly what kind of model it wants, > and simply lacks the computational resources to train it? Such a > company would specify what network architecture and training > procedure to use, and it would examine the trained model > closely. > > And yeah there is nothing new ;-) when one says that the result could be > biased by the person that produced the data. Yeah, we have to trust the > trainer as we are trusting the people who generated “biased” (*) genomic > references. > > Well, it is very interesting – and scary – to see how to theoretically > exploit “misclassify adversarial examples“ as described e.g. by [2]. > > This raises questions about “Verifiable Delegation of Learning”. > > From my point of view, the tackle of such biased weights is not via > re-learning because how to draw the line between biased weights, > mistakes on their side, mistakes on our side, etc. and it requires a > high level of expertise to complete a full re-learning. Instead, it > should come from the ML community that should standardize formal methods > for verifying that the training had not been biased, IMHO. > > 2: https://arxiv.org/abs/1412.6572 > > (*) biased genomic references, for one example among many others: > > Relatedly, reports have persisted of major artifacts that arise > when identifying variants relative to GRCh38, such as an > apparent imbalance between insertions and deletions (indels) > arising from systematic mis-assemblies in GRCh38 > [15–17]. Overall, these errors and omissions in GRCh38 introduce > biases in genomic analyses, particularly in centromeres, > satellites, and other complex regions. > > https://doi.org/10.1101/2021.07.12.452063 > > > Cheers, > simon ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-08 10:21 ` Nathan Dehnel @ 2023-04-11 8:37 ` Simon Tournier 2023-04-11 12:41 ` Nathan Dehnel 0 siblings, 1 reply; 8+ messages in thread From: Simon Tournier @ 2023-04-11 8:37 UTC (permalink / raw) To: Nathan Dehnel; +Cc: rprior, guix-devel Hi Nathan, Maybe there is a misunderstanding. :-) The subject is “Guideline for pre-trained ML model weight binaries”. My opinion on such guideline would to only consider the license of such data. Other considerations appear to me hard to be conclusive. What I am trying to express is that: 1) Bit-identical rebuild is worth, for sure!, and it addresses a class of attacks (e.g., Trusting trust described in 1984 [1]). Aside, I find this message by John Gilmore [2] very instructive about the history of bit-identical rebuilds. (Bit-identical rebuild had been considered by GNU in the early 90’s.) 2) Bit-identical rebuild is *not* the solution to all. Obviously. Many attacks are bit-identical. Consider the package ’python-pillow’, it builds bit-identically. But before c16add7fd9, it was subject to CVE-2022-45199. Only an human expertise to produce the patch [3] protects against the attack. Considering this, I am claiming that: a) Bit-identical re-train of ML models is similar to #2; other said that bit-identical re-training of ML model weights does not protect much against biased training. The only protection against biased training is by human expertise. Note that if the re-train is not bit-identical, what would be the conclusion about the trust? It falls under the cases of non bit-identical rebuild of packages as Julia or even Guile itself. b) The resources (human, financial, hardware, etc.) for re-training is, for most of the cases, not affordable. Not because it would be difficult or because the task is complex, this is covered by the point a), no it is because the requirements in term of resources is just to high. Consider that, for some cases where we do not have the resources, we already do not debootstrap. See GHC compiler (*) or Genomic references. And I am not saying it is impossible or we should not try, instead, I am saying we have to be pragmatic for some cases. Therefore, my opinion is that pre-trained ML model weight binaries should be included as any other data and the lack of debootstrapping is not an issue for inclusion in this particular cases. The question for inclusion about this pre-trained ML model binary weights is the license. Last, from my point of view, the tangential question is the size of such pre-trained ML model binary weights. I do not know if they fit the store. Well, that’s my opinion on this “Guidelines for pre-trained ML model weight binaries”. :-) (*) And Ricardo is training hard! See [4] and part 2 is yet published, IIRC. 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html Cheers, simon ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-11 8:37 ` Simon Tournier @ 2023-04-11 12:41 ` Nathan Dehnel 2023-04-12 9:32 ` Csepp 0 siblings, 1 reply; 8+ messages in thread From: Nathan Dehnel @ 2023-04-11 12:41 UTC (permalink / raw) To: Simon Tournier; +Cc: rprior, guix-devel a) Bit-identical re-train of ML models is similar to #2; other said that bit-identical re-training of ML model weights does not protect much against biased training. The only protection against biased training is by human expertise. Yeah, I didn't mean to give the impression that I thought bit-reproducibility was the silver bullet for AI backdoors with that analogy. I guess my argument is this: if they release the training info, either 1) it does not produce the bias/backdoor of the trained model, so there's no problem, or 2) it does, in which case an expert will be able to look at it and go "wait, that's not right", and will raise an alarm, and it will go public. The expert does not need to be affiliated with guix, but guix will eventually hear about it. Similar to how a normal security vulnerability works. b) The resources (human, financial, hardware, etc.) for re-training is, for most of the cases, not affordable. Not because it would be difficult or because the task is complex, this is covered by the point a), no it is because the requirements in term of resources is just to high. Maybe distributed substitutes could change that equation? On Tue, Apr 11, 2023 at 3:37 AM Simon Tournier <zimon.toutoune@gmail.com> wrote: > > Hi Nathan, > > Maybe there is a misunderstanding. :-) > > The subject is “Guideline for pre-trained ML model weight binaries”. My > opinion on such guideline would to only consider the license of such > data. Other considerations appear to me hard to be conclusive. > > > What I am trying to express is that: > > 1) Bit-identical rebuild is worth, for sure!, and it addresses a class > of attacks (e.g., Trusting trust described in 1984 [1]). Aside, I > find this message by John Gilmore [2] very instructive about the > history of bit-identical rebuilds. (Bit-identical rebuild had been > considered by GNU in the early 90’s.) > > 2) Bit-identical rebuild is *not* the solution to all. Obviously. > Many attacks are bit-identical. Consider the package > ’python-pillow’, it builds bit-identically. But before c16add7fd9, > it was subject to CVE-2022-45199. Only an human expertise to > produce the patch [3] protects against the attack. > > Considering this, I am claiming that: > > a) Bit-identical re-train of ML models is similar to #2; other said > that bit-identical re-training of ML model weights does not protect > much against biased training. The only protection against biased > training is by human expertise. > > Note that if the re-train is not bit-identical, what would be the > conclusion about the trust? It falls under the cases of non > bit-identical rebuild of packages as Julia or even Guile itself. > > b) The resources (human, financial, hardware, etc.) for re-training is, > for most of the cases, not affordable. Not because it would be > difficult or because the task is complex, this is covered by the > point a), no it is because the requirements in term of resources is > just to high. > > Consider that, for some cases where we do not have the resources, we > already do not debootstrap. See GHC compiler (*) or Genomic > references. And I am not saying it is impossible or we should not > try, instead, I am saying we have to be pragmatic for some cases. > > > Therefore, my opinion is that pre-trained ML model weight binaries > should be included as any other data and the lack of debootstrapping is > not an issue for inclusion in this particular cases. > > The question for inclusion about this pre-trained ML model binary > weights is the license. > > Last, from my point of view, the tangential question is the size of such > pre-trained ML model binary weights. I do not know if they fit the > store. > > Well, that’s my opinion on this “Guidelines for pre-trained ML model > weight binaries”. :-) > > > > (*) And Ricardo is training hard! See [4] and part 2 is yet published, > IIRC. > > 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf > 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html > 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch > 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html > > Cheers, > simon ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-11 12:41 ` Nathan Dehnel @ 2023-04-12 9:32 ` Csepp 0 siblings, 0 replies; 8+ messages in thread From: Csepp @ 2023-04-12 9:32 UTC (permalink / raw) To: Nathan Dehnel; +Cc: Simon Tournier, rprior, guix-devel Nathan Dehnel <ncdehnel@gmail.com> writes: > a) Bit-identical re-train of ML models is similar to #2; other said > that bit-identical re-training of ML model weights does not protect > much against biased training. The only protection against biased > training is by human expertise. > > Yeah, I didn't mean to give the impression that I thought > bit-reproducibility was the silver bullet for AI backdoors with that > analogy. I guess my argument is this: if they release the training > info, either 1) it does not produce the bias/backdoor of the trained > model, so there's no problem, or 2) it does, in which case an expert > will be able to look at it and go "wait, that's not right", and will > raise an alarm, and it will go public. The expert does not need to be > affiliated with guix, but guix will eventually hear about it. Similar > to how a normal security vulnerability works. > > b) The resources (human, financial, hardware, etc.) for re-training is, > for most of the cases, not affordable. Not because it would be > difficult or because the task is complex, this is covered by the > point a), no it is because the requirements in term of resources is > just to high. > > Maybe distributed substitutes could change that equation? Probably not, it would require distributed *builds*. Right now Guix can't even use distcc, so it definitely can't use remote GPUs. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries 2023-04-07 5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel 2023-04-07 9:42 ` Simon Tournier @ 2023-09-06 14:28 ` Andreas Enge 1 sibling, 0 replies; 8+ messages in thread From: Andreas Enge @ 2023-09-06 14:28 UTC (permalink / raw) To: Nathan Dehnel; +Cc: rprior, guix-devel Hello, related to this thread, I just came across an entry in Cory Doctorow's blog: https://pluralistic.net/2023/08/18/openwashing/#you-keep-using-that-word-i-do-not-think-it-means-what-you-think-it-means It is already interesting in its disection of the terms "open" vs. "free", which is quite relevant to us (but just echoes the sentiment I had anyway). The end can be seen as an invitation to *not* package neurol network related software at all: by packaging "big corporation X"'s free software, but which is untrainable on anything but big corporations' hardware, we actually help big corporation X to entrap users into its "ecosystem". Andreas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries @ 2023-09-12 7:36 Nathan Dehnel 0 siblings, 0 replies; 8+ messages in thread From: Nathan Dehnel @ 2023-09-12 7:36 UTC (permalink / raw) To: Andreas Enge, guix-devel That was fascinating, thanks for sharing. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-09-12 7:37 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-07 5:50 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Nathan Dehnel 2023-04-07 9:42 ` Simon Tournier 2023-04-08 10:21 ` Nathan Dehnel 2023-04-11 8:37 ` Simon Tournier 2023-04-11 12:41 ` Nathan Dehnel 2023-04-12 9:32 ` Csepp 2023-09-06 14:28 ` Guidelines for pre-trained ML model weight binaries Andreas Enge -- strict thread matches above, loose matches on Subject: below -- 2023-09-12 7:36 Nathan Dehnel
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/guix.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).