* Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
@ 2023-04-03 18:07 Ryan Prior
2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
2023-04-06 8:42 ` Simon Tournier
0 siblings, 2 replies; 21+ messages in thread
From: Ryan Prior @ 2023-04-03 18:07 UTC (permalink / raw)
To: Nicolas Graves, licensing@fsf.org; +Cc: guix-devel
Hi there FSF Licensing! (CC: Guix devel, Nicholas Graves) This morning I read through the FSDG to see if it gives any guidance on when machine learning model weights are appropriate for inclusion in a free system. It does not seem to offer much.
Many ML models are advertising themselves as "open source", including the llama model that Nicholas (quoted below) is interested in including into Guix. However, according to what I can find in Meta's announcement (https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) and the project's documentation (https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) the model itself is not covered by the GPLv3 but rather "a noncommercial license focused on research use cases." I cannot find the full text of this license anywhere in 20 minutes of searching, perhaps others have better ideas how to find it or perhaps the Meta team would provide a copy if we ask.
Free systems will see incentive to include trained models in their distributions to support use cases like automatic live transcription of audio, recognition of objects in photos and video, and natural language-driven help and documentation features. I hope we can update the FSDG to help ensure that any such inclusion fully meets the requirements of freedom for all our users.
Cheers,
Ryan
------- Original Message -------
On Monday, April 3rd, 2023 at 4:48 PM, Nicolas Graves via "Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote:
>
>
>
> Hi Guix!
>
> I've recently contributed a few tools that make a few OSS machine
> learning programs usable for Guix, namely nerd-dictation for dictation
> and llama-cpp as a converstional bot.
>
> In the first case, I would also like to contribute parameters of some
> localized models so that they can be used more easily through Guix. I've
> already discussed this subject when submitting these patches, without a
> clear answer.
>
> In the case of nerd-dictation, the model parameters that can be used
> are listed here : https://alphacephei.com/vosk/models
>
> One caveat is that using all these models can take a lot of space on the
> servers, a burden which is not useful because no build step are really
> needed (except an unzip step). In this case, we can use the
> #:substitutable? #f flag. You can find an example of some of these
> packages right here :
> https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm
>
> So my question is: Should we add this type of models in packages for
> Guix? If yes, where should we put them? In machine-learning.scm? In a
> new file machine-learning-models.scm (such a file would never need new
> modules, and it might avoid some confusion between the tools and the
> parameters needed to use the tools)?
>
>
> --
> Best regards,
> Nicolas Graves
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior @ 2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution. 2023-04-03 21:18 ` Jack Hill 2023-04-06 8:42 ` Simon Tournier 1 sibling, 1 reply; 21+ messages in thread From: Nicolas Graves via Development of GNU Guix and the GNU System distribution. @ 2023-04-03 20:48 UTC (permalink / raw) To: Ryan Prior, licensing@fsf.org; +Cc: guix-devel On 2023-04-03 18:07, Ryan Prior wrote: > Hi there FSF Licensing! (CC: Guix devel, Nicholas Graves) This morning I read through the FSDG to see if it gives any guidance on when machine learning model weights are appropriate for inclusion in a free system. It does not seem to offer much. > > Many ML models are advertising themselves as "open source", including the llama model that Nicholas (quoted below) is interested in including into Guix. However, according to what I can find in Meta's announcement (https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) and the project's documentation (https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) the model itself is not covered by the GPLv3 but rather "a noncommercial license focused on research use cases." I cannot find the full text of this license anywhere in 20 minutes of searching, perhaps others have better ideas how to find it or perhaps the Meta team would provide a copy if we ask. Just to be precise on llama, what I proposed was to include the port of Facebook code to CPP, (llama.cpp, see ticket 62443 on guix-patches), which itself has a license. The weight themselves indeed do not have a license. You can only download them through torrents because they were leaked. For this model in particular, we can't include them in Guix indeed (also because of their sheer size). The other case I evoked and one that is more mature is the case of VOSK audio recognition, which model binaries have an Apache license (you can find them here: https://alphacephei.com/vosk/models > > Free systems will see incentive to include trained models in their distributions to support use cases like automatic live transcription of audio, recognition of objects in photos and video, and natural language-driven help and documentation features. I hope we can update the FSDG to help ensure that any such inclusion fully meets the requirements of freedom for all our users. Thanks for this email and the question about these guidelines, Ryan. I would be glad to help if I can. > > Cheers, > Ryan > > > ------- Original Message ------- > On Monday, April 3rd, 2023 at 4:48 PM, Nicolas Graves via "Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote: > > >> >> >> >> Hi Guix! >> >> I've recently contributed a few tools that make a few OSS machine >> learning programs usable for Guix, namely nerd-dictation for dictation >> and llama-cpp as a converstional bot. >> >> In the first case, I would also like to contribute parameters of some >> localized models so that they can be used more easily through Guix. I've >> already discussed this subject when submitting these patches, without a >> clear answer. >> >> In the case of nerd-dictation, the model parameters that can be used >> are listed here : https://alphacephei.com/vosk/models >> >> One caveat is that using all these models can take a lot of space on the >> servers, a burden which is not useful because no build step are really >> needed (except an unzip step). In this case, we can use the >> #:substitutable? #f flag. You can find an example of some of these >> packages right here : >> https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm >> >> So my question is: Should we add this type of models in packages for >> Guix? If yes, where should we put them? In machine-learning.scm? In a >> new file machine-learning-models.scm (such a file would never need new >> modules, and it might avoid some confusion between the tools and the >> parameters needed to use the tools)? >> >> >> -- >> Best regards, >> Nicolas Graves -- Best regards, Nicolas Graves ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution. @ 2023-04-03 21:18 ` Jack Hill 0 siblings, 0 replies; 21+ messages in thread From: Jack Hill @ 2023-04-03 21:18 UTC (permalink / raw) To: Nicolas Graves; +Cc: Ryan Prior, licensing@fsf.org, guix-devel On Mon, 3 Apr 2023, Nicolas Graves via "Development of GNU Guix and the GNU System distribution. wrote: > Just to be precise on llama, what I proposed was to include the port of > Facebook code to CPP, (llama.cpp, see ticket 62443 on guix-patches), > which itself has a license. > > The weight themselves indeed do not have a license. You can only > download them through torrents because they were leaked. For this model > in particular, we can't include them in Guix indeed (also because of > their sheer size). > > The other case I evoked and one that is more mature is the case of VOSK > audio recognition, which model binaries have an Apache license (you can > find them here: https://alphacephei.com/vosk/models One more case to consider: rnnoise, which is already packaged in Guix. If the licence filed is to be believed the weights are bsd3. For how they are created, see the `TRAINING-README` file in the source. That procedure produced `rnn_data.{c,h}` files which are present in the outputs of `guix build -S rnnoise`. Best, Jack ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior 2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution. @ 2023-04-06 8:42 ` Simon Tournier 2023-04-06 13:41 ` Kyle 2023-05-13 4:13 ` 宋文武 1 sibling, 2 replies; 21+ messages in thread From: Simon Tournier @ 2023-04-06 8:42 UTC (permalink / raw) To: Ryan Prior, Nicolas Graves, licensing@fsf.org; +Cc: guix-devel Hi, On Mon, 03 Apr 2023 at 18:07, Ryan Prior <rprior@protonmail.com> wrote: > Hi there FSF Licensing! (CC: Guix devel, Nicholas Graves) This morning > I read through the FSDG to see if it gives any guidance on when > machine learning model weights are appropriate for inclusion in a free > system. It does not seem to offer much. Years ago, I asked to FSF and Stallman how to deal with that and I had never got an answer back. Anyway! :-) Debian folks discussed such topic [1,2] but I do not know if they have an “official” policy. I remember we discussed on guix-devel or guix-patches similar topic some years ago – but I do not find back the thread. For what my opinion is worth, I think that machine learning model weights should be considered as any other data (images, text files, translated strings, etc.) and thus they are appropriated for inclusion or not depending on if their license is compliant. Since it is computing, we could ask about the bootstrap of such generated data. I think it is a slippery slope because it is totally not affordable to re-train for many cases: (1) we would not have the hardware resources from a practical point of view,, (2) it is almost impossible to tackle the source of indeterminism (the optimization is too entailed with randomness). From my point of view, pre-trained weights should be considered as the output of a (numerical) experiment, similarly as we include other experimental data (from genome to astronomy dataset). 1: https://salsa.debian.org/deeplearning-team/ml-policy 2: https://people.debian.org/~lumin/debian-dl.html Cheers, simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-06 8:42 ` Simon Tournier @ 2023-04-06 13:41 ` Kyle 2023-04-06 14:53 ` Simon Tournier 2023-05-13 4:13 ` 宋文武 1 sibling, 1 reply; 21+ messages in thread From: Kyle @ 2023-04-06 13:41 UTC (permalink / raw) To: guix-devel, Simon Tournier, Ryan Prior, Nicolas Graves, licensing@fsf.org >Since it is computing, we could ask about the bootstrap of such >generated data. I think it is a slippery slope because it is totally >not affordable to re-train for many cases: (1) we would not have the >hardware resources from a practical point of view,, (2) it is almost >impossible to tackle the source of indeterminism (the optimization is >too entailed with randomness). I have only seen situations where the optimization is "too entailed with randomness" when models are trained on proprietary GPUs with specific settings. Otherwise, pseudo-random seeds are perfectly sufficient to remove the indeterminism. => https://discourse.julialang.org/t/flux-reproducibility-of-gpu-experiments/62092 Many people think that "ultimate" reproducibility is not a practical either. It's always going to be easier in the short term to take shortcuts which make conclusions dependent on secret sauce which few can understand. => https://hpc.guix.info/blog/2022/07/is-reproducibility-practical/ From my point of view, pre-trained >weights should be considered as the output of a (numerical) experiment, >similarly as we include other experimental data (from genome to >astronomy dataset). I think its a stretch to consider a data compression as an experiment. In experiments I am always finding mistakes which confuse the interpretation hidden by prematurely compressing data, e.g. by taking inappropriate averages. Don't confuse the actual experimental results with dubious data processing steps. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-06 13:41 ` Kyle @ 2023-04-06 14:53 ` Simon Tournier 0 siblings, 0 replies; 21+ messages in thread From: Simon Tournier @ 2023-04-06 14:53 UTC (permalink / raw) To: Kyle; +Cc: guix-devel, Ryan Prior, Nicolas Graves Hi, On Thu, 6 Apr 2023 at 15:41, Kyle <kyle@posteo.net> wrote: > I have only seen situations where the optimization is "too entailed with randomness" when models are trained on proprietary GPUs with specific settings. Otherwise, pseudo-random seeds are perfectly sufficient to remove the indeterminism. Feel free to pick real-world model using 15 billions of parameters and then to train it again. And if you succeed, feel free to train it again to have bit-to-bit reproducibility. Bah the cost (CPU or GPU power and at the end the electricity, so real money) would not be nothing and I am far to be convinced that paying this bill is worth, reproducibility speaking. > => https://discourse.julialang.org/t/flux-reproducibility-of-gpu-experiments/62092 Ahah! I am laughing when Julia is already not reproducible itself. https://issues.guix.gnu.org/22304 https://issues.guix.gnu.org/47354 And upstream does not care much as you can see https://github.com/JuliaLang/julia/issues/25900 https://github.com/JuliaLang/julia/issues/34753 Well, years ago Nicolô made a patch for improving but it has not been merged yet. For instance, some people are trying to have "reproducible" benchmark of machine learning, https://benchopt.github.io/ and last time I checked, they have good times and a lot of fun. ;-) Well, I would be less confident than "pseudo-random seeds are perfectly sufficient to remove the indeterminism". :-) > Many people think that "ultimate" reproducibility is not a practical either. It's always going to be easier in the short term to take shortcuts which make conclusions dependent on secret sauce which few can understand. > > => https://hpc.guix.info/blog/2022/07/is-reproducibility-practical/ Depending on the size of the model, training it again is not practical. Similarly, the computation for predicting weather forecast is not practically reproducible and no one is ready to put the amount of money on the table to do so. Instead, people are exchanging dataset of pressure maps. Bit-to-bit reproducibility is a mean for verifying the correctness between some claim and what had concretely be done. But that's not the only mean. Speaking about some scientific method point of view, it is false to think that it is possible to reproduce all or that it is possible to reproduce all. Consider theoretical physics experiment by LHC; in this case, the confidence in the result is not done using independent bit-to-bit reproducibility but by as much as possible transparency of all the stages. Moreover, what Ludo wrote in this blog post is their own points of view and for example I do not share all. Anyway. :-) For sure, bit-to-bit reproducible is not a end for trusting one result but a mean among many others. It is possible to have bit-to-bit reproducible results that are wrong and other results impossible to reproduce bit-to-bit that are correct. Well, back to Julia, since part of Julia is not bit-to-bit reproducible, does it mean that the scientific outputs generated using Julia are not trustable? All that said, if the re-computation of the weights is affordable because the size of the model is affordable, yes for sure, we could try. But from my point of view, the re-computation of the weights should not be blocking for inclusion. What should be blocking is the license of this data (weights). > From my point of view, pre-trained > >weights should be considered as the output of a (numerical) experiment, > >similarly as we include other experimental data (from genome to > >astronomy dataset). > > I think its a stretch to consider a data compression as an experiment. In experiments I am always finding mistakes which confuse the interpretation hidden by prematurely compressing data, e.g. by taking inappropriate averages. Don't confuse the actual experimental results with dubious data processing steps. I do not see where I speak about data compression. Anyway. :-) Well, I claim that data processing is an experiment. There is no "actual experiment" and "data processing". It is a continuum. Today, any instrument generating data does internally numerical processing. Other said, what you consider as your raw inputs is considered as output by other, so by following the recursive problem, the true original raw material is physical samples and that is what we should package, i.e., we should send by post mail these physical samples and then reproduce all. Here, I am stretching. ;-) The genomic references that we already packaged are also the result of "data processing" that no one is redoing. I do not see any difference between the weights of machine learning models and these genomic references; they are both generated data resulting from a experiment (broad meaning). Cheers, simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-06 8:42 ` Simon Tournier 2023-04-06 13:41 ` Kyle @ 2023-05-13 4:13 ` 宋文武 2023-05-15 11:18 ` Simon Tournier 1 sibling, 1 reply; 21+ messages in thread From: 宋文武 @ 2023-05-13 4:13 UTC (permalink / raw) To: Simon Tournier; +Cc: Ryan Prior, Nicolas Graves, guix-devel, zamfofex Simon Tournier <zimon.toutoune@gmail.com> writes: > Since it is computing, we could ask about the bootstrap of such > generated data. I think it is a slippery slope because it is totally > not affordable to re-train for many cases: (1) we would not have the > hardware resources from a practical point of view,, (2) it is almost > impossible to tackle the source of indeterminism (the optimization is > too entailed with randomness). From my point of view, pre-trained > weights should be considered as the output of a (numerical) experiment, > similarly as we include other experimental data (from genome to > astronomy dataset). > > 1: https://salsa.debian.org/deeplearning-team/ml-policy > 2: https://people.debian.org/~lumin/debian-dl.html > Hello, zamfofex submited a package 'lc0', Leela Chess Zero” (a chess engine) with ML model, also it turn out that we already had 'stockfish' a similiar one with pre-trained model packaged. Does we reached a conclusion (so lc0 can also be accepted)? Or should we remove 'stockfish'? Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-05-13 4:13 ` 宋文武 @ 2023-05-15 11:18 ` Simon Tournier 2023-05-26 15:37 ` Ludovic Courtès 0 siblings, 1 reply; 21+ messages in thread From: Simon Tournier @ 2023-05-15 11:18 UTC (permalink / raw) To: 宋文武; +Cc: Ryan Prior, Nicolas Graves, guix-devel, zamfofex Hi, On sam., 13 mai 2023 at 12:13, 宋文武 <iyzsong@envs.net> wrote: > Hello, zamfofex submited a package 'lc0', Leela Chess Zero” (a chess > engine) with ML model, also it turn out that we already had 'stockfish' > a similiar one with pre-trained model packaged. Does we reached a > conclusion (so lc0 can also be accepted)? Or should we remove 'stockfish'? Well, I do not know if we have reached a conclusion. From my point of view, both can be included *if* their licenses are compatible with Free Software – included the weights (pre-trained model) as licensed data. Cheers, simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-05-15 11:18 ` Simon Tournier @ 2023-05-26 15:37 ` Ludovic Courtès 2023-05-29 3:57 ` zamfofex 2023-05-30 13:15 ` Simon Tournier 0 siblings, 2 replies; 21+ messages in thread From: Ludovic Courtès @ 2023-05-26 15:37 UTC (permalink / raw) To: Simon Tournier Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel, zamfofex Hello, Simon Tournier <zimon.toutoune@gmail.com> skribis: > On sam., 13 mai 2023 at 12:13, 宋文武 <iyzsong@envs.net> wrote: > >> Hello, zamfofex submited a package 'lc0', Leela Chess Zero” (a chess >> engine) with ML model, also it turn out that we already had 'stockfish' >> a similiar one with pre-trained model packaged. Does we reached a >> conclusion (so lc0 can also be accepted)? Or should we remove 'stockfish'? > > Well, I do not know if we have reached a conclusion. From my point of > view, both can be included *if* their licenses are compatible with Free > Software – included the weights (pre-trained model) as licensed data. We discussed it in 2019: https://issues.guix.gnu.org/36071 This LWN article on the debate that then took place in Debian is insightful: https://lwn.net/Articles/760142/ To me, there is no doubt that neural networks are a threat to user autonomy: hard to train by yourself without very expensive hardware, next to impossible without proprietary software, plus you need that huge amount of data available to begin with. As a project, we don’t have guidelines about this though. I don’t know if we can come up with general guidelines or if we should, at least as a start, look at things on a case-by-case basis. Ludo’. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-05-26 15:37 ` Ludovic Courtès @ 2023-05-29 3:57 ` zamfofex 2023-05-30 13:15 ` Simon Tournier 1 sibling, 0 replies; 21+ messages in thread From: zamfofex @ 2023-05-29 3:57 UTC (permalink / raw) To: Ludovic Courtès, Simon Tournier, 宋文武 Cc: Ryan Prior, Nicolas Graves, guix-devel > To me, there is no doubt that neural networks are a threat to user > autonomy: hard to train by yourself without very expensive hardware, > next to impossible without proprietary software, plus you need that huge > amount of data available to begin with. > > As a project, we don’t have guidelines about this though. I don’t know > if we can come up with general guidelines or if we should, at least as a > start, look at things on a case-by-case basis. I feel like it’s important to have a guideline for this, at least if the issue becomes recurrent too frequently. To me, a sensible *base criterion* is whether the user is able to practically produce their own networks (either from scratch, or by using the an existing networks) using free software alone. I feel like this solves the issue of user autonomy being in risk. By “practically produce”, I mean within reasonable time (somewhere between a few minutes and a few weeks depending on the scope) and using exclusively hardware they physically own (assuming they own reasonbly recent hardware to run Guix, at least). The effect is that the user shouldn’t be bound to the provided networks, and should be able to train their own for their own purposes if they so choose, even if using the existing networks during that training. (And in the context of Guix, the neural network needs to be packaged for the user to be able to use it that way.) Regarding Lc0 specifically, that is already possible! The Lc0 project has a training client that can use existing networks and a set of configurations to train your own special‐purpose network. (And although this client supports proprietary software, it is able to run using exclusively free software too.) In fact, there are already community‐provided networks for Lc0[1], which sometimes can play even more accurately than the official ones (or otherwise play differently in various specialised ways). Of course, this might seem very dissatisfying in the same way as providing binary seeds for software programs is. In the sense that if you require an existing network to further train networks, rather than being able to start a network from scratch (in this case). But I feel like (at least under my “base criterion”), the effects of this to the user are not as significant, since the effects of the networks are limited compared to those of actual programs. In the sense that, even though you might want to argue that “the network affects the behavior of the program using it” in the same way as “a Python source file affects the behavior of its interpreter”, the effect of the network file for the program is limited compared to that of a Python program. It’s much more like how an image would affect the affect the behavior of the program displaying it. More concretely, there isn’t a trust issue to be solved, because the network doesn’t have as many capabilities (theoretical or practical) as a program does. I say “practical capabilities” in the sense of being access user resources and data for purposes they don’t want. (E.g. By accessing/modifying their files, sharing their data through the Internet without their acknowledgement, etc.) I say “theoretical capabilities” in the sense of doing things the user doesn’t want nor expects, i.e. thinking about using computations as a tool for some purpose. (E.g. Even sandboxed/containerised programs can be harmful, because the program could behave in a way the user doesn’t want without letting the user do something about it.) The only autonomy‐disrespecting (or perhaps rather freedom‐disrespecting) issue is when the user is stuck with the provided network, and doesn’t have any tools to (practically) change how the program behaves by creating a different network that suits their needs. (Which is what my “base criterion” tries to defend against.) This is not the case with Lc0, as I said. Finally, I will also note that, in addition to the aforementioned[2] fact that Stockfish (already packaged) does use pre‐trained neural networks too, the lastest versions of Stockfish (from 14 onward) use neural networks that have themselves been indirectly trained using the networks from the Lc0 project.[3] [1]: See <https://lczero.org/play/networks/basics/#training-data> [2]: It was mentioned in <https://issues.guix.gnu.org/63088> [3]: See <https://stockfishchess.org/blog/2021/stockfish-14/> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-05-26 15:37 ` Ludovic Courtès 2023-05-29 3:57 ` zamfofex @ 2023-05-30 13:15 ` Simon Tournier 2023-07-02 19:51 ` Ludovic Courtès 1 sibling, 1 reply; 21+ messages in thread From: Simon Tournier @ 2023-05-30 13:15 UTC (permalink / raw) To: Ludovic Courtès Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel, zamfofex Hi Ludo, On ven., 26 mai 2023 at 17:37, Ludovic Courtès <ludo@gnu.org> wrote: >> Well, I do not know if we have reached a conclusion. From my point of >> view, both can be included *if* their licenses are compatible with Free >> Software – included the weights (pre-trained model) as licensed data. > > We discussed it in 2019: > > https://issues.guix.gnu.org/36071 Your concern in this thread was: My point is about whether these trained neural network data are something that we could distribute per the FSDG. https://issues.guix.gnu.org/36071#3-lineno21 and we discussed this specific concern for the package leela-zero. Quoting 3 messages: Perhaps we could do the same, but I’d like to hear what others think. Back to this patch: I think it’s fine to accept it as long as the software necessary for training is included. The whole link is worth a click since there seems to be a ‘server component’ involved as well. https://issues.guix.gnu.org/36071#3-lineno31 https://issues.guix.gnu.org/36071#5-lineno52 https://issues.guix.gnu.org/36071#6-lineno18 And somehow I am rising the same concern for packages using weights. We could discuss case-by-case, instead I find important to sketch guidelines about the weights because it would help to decide what to do with neuronal networks; as “Leela Chess Zero” [1] or others (see below). 1: https://issues.guix.gnu.org/63088 > This LWN article on the debate that then took place in Debian is > insightful: > > https://lwn.net/Articles/760142/ As pointed in #36071 mentioned above, this LWN article is a digest of some Debian discussion, and it is also worth to give a look to the raw material (arguments): https://lists.debian.org/debian-devel/2018/07/msg00153.html > To me, there is no doubt that neural networks are a threat to user > autonomy: hard to train by yourself without very expensive hardware, > next to impossible without proprietary software, plus you need that huge > amount of data available to begin with. About the “others” from above, please note that GNU Backgamon, already packaged in Guix with the name ’gnubg’, asks similar questions. :-) Quoting the webpage [2]: Tournament match and money session cube handling and cubeful play. All governed by underlying cubeless money game based neural networks. As Russ Allbery is pointing [3] – similarly as I tried to do in this thread – it seems hard to distinguish the data resulting from a pre-processing as some training to the data just resulting from good fitted parameters. 2: https://www.gnu.org/software/gnubg/ 3: https://lwn.net/Articles/760199/ > As a project, we don’t have guidelines about this though. I don’t know > if we can come up with general guidelines or if we should, at least as a > start, look at things on a case-by-case basis. Somehow, if we do not have guidelines for helping in deciding, it makes harder the review of #63088 [1] asking the inclusion of lc0 or it makes hard to know what to do about GNU Backgamon. On these specific cases, what do we do? :-) Cheers, simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-05-30 13:15 ` Simon Tournier @ 2023-07-02 19:51 ` Ludovic Courtès 2023-07-03 9:39 ` Simon Tournier 0 siblings, 1 reply; 21+ messages in thread From: Ludovic Courtès @ 2023-07-02 19:51 UTC (permalink / raw) To: Simon Tournier Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel, zamfofex Hi, Simon Tournier <zimon.toutoune@gmail.com> skribis: > Somehow, if we do not have guidelines for helping in deciding, it makes > harder the review of #63088 [1] asking the inclusion of lc0 or it makes > hard to know what to do about GNU Backgamon. > > On these specific cases, what do we do? :-) Someone™ has to invest time in studying this specific case, look at what others like Debian are doing, and seek consensus on a way forward. Based on that, perhaps Someone™ can generalize that decision-making process into more generic guidelines. It would definitely be beneficial. :-) (Yes, that’s very hand-wavy, but that’s really what needs to happen!) Thanks, Ludo’. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-07-02 19:51 ` Ludovic Courtès @ 2023-07-03 9:39 ` Simon Tournier 2023-07-04 13:05 ` zamfofex 0 siblings, 1 reply; 21+ messages in thread From: Simon Tournier @ 2023-07-03 9:39 UTC (permalink / raw) To: Ludovic Courtès Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel, zamfofex Hi, On Sun, 02 Jul 2023 at 21:51, Ludovic Courtès <ludo@gnu.org> wrote: > Someone™ has to invest time in studying this specific case, look at what > others like Debian are doing, and seek consensus on a way forward. Hum, I am probably not this Someone™ but here the result of my looks. :-) First, please note that the Debian thread [1] is about, Concerns to software freedom when packaging deep-learning based appications and not about specifically Leela Zero. The thread asked the general question keeping in mind the packaging of leela-zero [2] – now available with Debian [3]. Back on 2019, patch#36071 [4] also introduced leela-zero in Guix. The issues about Machine Learning that this package raises are not gone, although this specific package had been included, because we are lacking some guideline. :-) Quoting the Debian package description [3], it reads, Recomputing the AlphaGo Zero weights will take about 1700 years on commodity hardware. Upstream is running a public, distributed effort to repeat this work. Working together, and especially when starting on a smaller scale, it will take less than 1700 years to get a good network (which you can feed into this program, suddenly making it strong). To help with this effort, run the leelaz-autogtp binary provided in this package. The best-known network weights file is at http://zero.sjeng.org/best-network For instance, this message [5] from patch#36071, We need to ensure that the software necessary to train the networks is included. Is this the case? Back to this patch: I think it’s fine to accept it as long as the software necessary for training is included. would suggest that the training software should be part of the package for inclusion although it would not be affordable to recompute this training. Well, applying this “criteria”, then GNU Backgamon (package gnubg included since a while) should be rejected since there is no training software for the neural network weights. For the inclusion of leela-zero in Guix, the argument is from [6] quoting [7]: So this is basically a distributed computing client as well as a Go engine that runs with the results of that distributed computing effort. If that's true, there is no separate ‘training software’ to worry about. which draws the line: free client vs the “database”; as pointed in [8]. Somehow, we have to distinguish cases depending on the weights. If the weights are clearly separated, as with leela-zero, then the code (neural network) itself can be included. Else if the weights are tangled with the code, then we distribute them only if their license is compliant with FSDG as any other data, as with GNU Backgamon, IIUC. Well, I do not see any difference between pre-trained weights and icons or sound or good fitted-parameters (e.g., the package python-scikit-learn has a lot ;-)). As I said elsewhere, I do not see the difference between pre-trained neural network weights and genomic references (e.g., the package r-bsgenome-hsapiens-1000genomes-hs37d5). The only question for inclusion or not is about the license, IMHO. For sure, it is far better if we are able to recompute the weights. However, similarly as debootstrapping is a strong recommendation, the ability to recompute the pre-trained neural network weights must just be a recommendation. Please note this message [11] from Nicolas about VOSK models and patch#62443 [12] already merged; the weights are separated as with the package leela-zero. All that said. Second, please note Debian thread dates from 2018, as well as the LWN article [13]; and I am not aware of something new since then. Third, I have never read something on this topic produced by GNU or related; and the fact that GNU Backgamon [14] distributes the weights without the way to reproduce them draws one line. Fourth, we – at least I – are still waiting an answer from licensing@fsf.org; on FSF side, I am only aware about this [15] and also these white papers [16] about the very specific case of Copilot. On Debian side, I am only aware of [16,17]: Unofficial Policy for Debian & Machine Learning 1: https://lists.debian.org/debian-devel/2018/07/msg00153.html 2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903634 3: https://packages.debian.org/trixie/leela-zero 4: https://issues.guix.gnu.org/36071 5: https://issues.guix.gnu.org/36071#5 6: https://issues.guix.gnu.org/36071#6 7: https://lwn.net/Articles/760483/ 8: https://issues.guix.gnu.org/36071#4-lineno34 9: https://yhetil.org/guix/87v8gtzvu3.fsf@gmail.com 11: https://yhetil.org/guix/87jzyshpyr.fsf@ngraves.fr 12: https://issues.guix.gnu.org/62443 10: https://lwn.net/Articles/760142/ 11: https://www.gnu.org/software/gnubg/ 12: https://www.fsf.org/bulletin/2022/spring/unjust-algorithms/ 13: https://www.fsf.org/news/publication-of-the-fsf-funded-white-papers-on-questions-around-copilot 16: https://salsa.debian.org/deeplearning-team/ml-policy 17: https://people.debian.org/~lumin/debian-dl.html Cheers, simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-07-03 9:39 ` Simon Tournier @ 2023-07-04 13:05 ` zamfofex 2023-07-04 20:03 ` Vagrant Cascadian 0 siblings, 1 reply; 21+ messages in thread From: zamfofex @ 2023-07-04 13:05 UTC (permalink / raw) To: Simon Tournier, Ludovic Courtès Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel > On 07/03/2023 6:39 AM -03 Simon Tournier <zimon.toutoune@gmail.com> wrote: > > Well, I do not see any difference between pre-trained weights and icons > or sound or good fitted-parameters (e.g., the package > python-scikit-learn has a lot ;-)). As I said elsewhere, I do not see > the difference between pre-trained neural network weights and genomic > references (e.g., the package r-bsgenome-hsapiens-1000genomes-hs37d5). I feel like, although this might (arguably) not be the case for leela-zero nor Lc0 specifically, for certain machine learning projects, a pretrained network can affect the program’s behavior so deeply that it might be considered a program itself! Such networks usually approximate an arbitrary function. The more complex the model is, the more complex the behavior of this function can be, and thus the closer to being an arbitrary program it is. But this “program” has no source code, it is effectively created in this binary form that is difficult to analyse. In any case, I feel like the issue Ludovic was talking about “user autonomy” is fairly relevant (as I understand it). For icons, images, and other similar kinds of assets, it is easy enough for the user to replace them, or create their own if they want. But for pretrained networks, even if they are under a free license, the user might not be able to easily create their own network that suits their purposes. For example, for an image recognition software, there might be data provided by the maintainers of the program that is able to recognise a specific set of objects in input images, but the user might want to use it to recognise a different kind of object. If it is too costly for the user to train a new network for their purposes (in terms of hardware and time required), the user is effectively entirely bound by the decisions of the maintainers of the software, and they can’t change it to suit their purposes. In that sense, there *might* be room for the maintainers to intentionally and maliciously bind the user to the kinds of data they want to provide. And perhaps even more likely (and even more dangerously), when the data is opaque enough, there is room for the maintainers to bias the networks in obscure ways without telling the user. You can imagine this being used in the context of, say, text generation or translation, for the developers to embed a certain opinion they have into the network in order to bias people towards it. But even when not done maliciously, this can still be limiting to the user if they are unable to easily train their own networks as a replacement. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-07-04 13:05 ` zamfofex @ 2023-07-04 20:03 ` Vagrant Cascadian 0 siblings, 0 replies; 21+ messages in thread From: Vagrant Cascadian @ 2023-07-04 20:03 UTC (permalink / raw) To: zamfofex, Simon Tournier, Ludovic Courtès Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel [-- Attachment #1: Type: text/plain, Size: 2784 bytes --] On 2023-07-04, zamfofex wrote: >> On 07/03/2023 6:39 AM -03 Simon Tournier <zimon.toutoune@gmail.com> wrote: >> >> Well, I do not see any difference between pre-trained weights and icons >> or sound or good fitted-parameters (e.g., the package >> python-scikit-learn has a lot ;-)). As I said elsewhere, I do not see >> the difference between pre-trained neural network weights and genomic >> references (e.g., the package r-bsgenome-hsapiens-1000genomes-hs37d5). > > I feel like, although this might (arguably) not be the case for > leela-zero nor Lc0 specifically, for certain machine learning > projects, a pretrained network can affect the program’s behavior so > deeply that it might be considered a program itself! Such networks > usually approximate an arbitrary function. The more complex the model > is, the more complex the behavior of this function can be, and thus > the closer to being an arbitrary program it is. > > But this “program” has no source code, it is effectively created in > this binary form that is difficult to analyse. > > In any case, I feel like the issue Ludovic was talking about “user > autonomy” is fairly relevant (as I understand it). For icons, images, > and other similar kinds of assets, it is easy enough for the user to > replace them, or create their own if they want. But for pretrained > networks, even if they are under a free license, the user might not be > able to easily create their own network that suits their purposes. > > For example, for an image recognition software, there might be data > provided by the maintainers of the program that is able to recognise a > specific set of objects in input images, but the user might want to > use it to recognise a different kind of object. If it is too costly > for the user to train a new network for their purposes (in terms of > hardware and time required), the user is effectively entirely bound by > the decisions of the maintainers of the software, and they can’t > change it to suit their purposes. For a more concrete example, with facial reconition in particular, many models are quite good at recognition of faces of people of predominantly white european descent, and not very good with people of other backgrounds, in particular with darker skin. The models frequently reflect the blatant and subtle biases of the society in which they are created, and the creators who develop the models. This can have disasterous consequences when using these models without that understanding... (or even if you do understand the general biases!) This seems like a significant issue for user freedom; with source code, you can at least in theory examine the biases of the software you are using. live well, vagrant [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) @ 2023-04-07 5:50 Nathan Dehnel 2023-04-07 9:42 ` Simon Tournier 0 siblings, 1 reply; 21+ messages in thread From: Nathan Dehnel @ 2023-04-07 5:50 UTC (permalink / raw) To: rprior, guix-devel I am uncomfortable with including ML models without their training data available. It is possible to hide backdoors in them. https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-07 5:50 Nathan Dehnel @ 2023-04-07 9:42 ` Simon Tournier 2023-04-08 10:21 ` Nathan Dehnel 0 siblings, 1 reply; 21+ messages in thread From: Simon Tournier @ 2023-04-07 9:42 UTC (permalink / raw) To: Nathan Dehnel, rprior, guix-devel Hi, On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote: > I am uncomfortable with including ML models without their training > data available. It is possible to hide backdoors in them. > https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/ Thanks for pointing this article! And some non-mathematical part of the original article [1] are also worth to give a look. :-) First please note that we are somehow in the case “The Open Box”, IMHO: But what if a company knows exactly what kind of model it wants, and simply lacks the computational resources to train it? Such a company would specify what network architecture and training procedure to use, and it would examine the trained model closely. And yeah there is nothing new ;-) when one says that the result could be biased by the person that produced the data. Yeah, we have to trust the trainer as we are trusting the people who generated “biased” (*) genomic references. Well, it is very interesting – and scary – to see how to theoretically exploit “misclassify adversarial examples“ as described e.g. by [2]. This raises questions about “Verifiable Delegation of Learning”. From my point of view, the tackle of such biased weights is not via re-learning because how to draw the line between biased weights, mistakes on their side, mistakes on our side, etc. and it requires a high level of expertise to complete a full re-learning. Instead, it should come from the ML community that should standardize formal methods for verifying that the training had not been biased, IMHO. 2: https://arxiv.org/abs/1412.6572 (*) biased genomic references, for one example among many others: Relatedly, reports have persisted of major artifacts that arise when identifying variants relative to GRCh38, such as an apparent imbalance between insertions and deletions (indels) arising from systematic mis-assemblies in GRCh38 [15–17]. Overall, these errors and omissions in GRCh38 introduce biases in genomic analyses, particularly in centromeres, satellites, and other complex regions. https://doi.org/10.1101/2021.07.12.452063 Cheers, simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-07 9:42 ` Simon Tournier @ 2023-04-08 10:21 ` Nathan Dehnel 2023-04-11 8:37 ` Simon Tournier 0 siblings, 1 reply; 21+ messages in thread From: Nathan Dehnel @ 2023-04-08 10:21 UTC (permalink / raw) To: Simon Tournier; +Cc: rprior, guix-devel >From my point of view, the tackle of such biased weights is not via re-learning because how to draw the line between biased weights, mistakes on their side, mistakes on our side, etc. and it requires a high level of expertise to complete a full re-learning. This strikes me as similar to being in the 80s, when Stallman was writing the GPL, years before Nix was invented, and saying "the solution to backdoors in executables is not access to source code due to the difficulty of compiling from scratch for the average user and due to the difficulty of making bit-reproducible binaries." Like, bit reproducibility WAS possible, it was just difficult, so practically speaking users had to use distro binaries they couldn't fully trust. So some of the benefits of the source code being available were rather theoretical for a while. So this argument strikes me as pre-emptively compromising one's principles based on the presumption that a new technology will never come along that allows one to practically exploit the benefits of said principles. >Instead, it should come from the ML community that should standardize formal methods for verifying that the training had not been biased, IMHO. What "formal methods" for that are known? As per the article, the hiding of the backdoor in the "whitebox" scenario is cryptographically secure in the specific case, with that same possibility open for the general case. On Fri, Apr 7, 2023 at 5:53 AM Simon Tournier <zimon.toutoune@gmail.com> wrote: > > Hi, > > On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote: > > > I am uncomfortable with including ML models without their training > > data available. It is possible to hide backdoors in them. > > https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/ > > Thanks for pointing this article! And some non-mathematical part of the > original article [1] are also worth to give a look. :-) > > First please note that we are somehow in the case “The Open Box”, IMHO: > > But what if a company knows exactly what kind of model it wants, > and simply lacks the computational resources to train it? Such a > company would specify what network architecture and training > procedure to use, and it would examine the trained model > closely. > > And yeah there is nothing new ;-) when one says that the result could be > biased by the person that produced the data. Yeah, we have to trust the > trainer as we are trusting the people who generated “biased” (*) genomic > references. > > Well, it is very interesting – and scary – to see how to theoretically > exploit “misclassify adversarial examples“ as described e.g. by [2]. > > This raises questions about “Verifiable Delegation of Learning”. > > From my point of view, the tackle of such biased weights is not via > re-learning because how to draw the line between biased weights, > mistakes on their side, mistakes on our side, etc. and it requires a > high level of expertise to complete a full re-learning. Instead, it > should come from the ML community that should standardize formal methods > for verifying that the training had not been biased, IMHO. > > 2: https://arxiv.org/abs/1412.6572 > > (*) biased genomic references, for one example among many others: > > Relatedly, reports have persisted of major artifacts that arise > when identifying variants relative to GRCh38, such as an > apparent imbalance between insertions and deletions (indels) > arising from systematic mis-assemblies in GRCh38 > [15–17]. Overall, these errors and omissions in GRCh38 introduce > biases in genomic analyses, particularly in centromeres, > satellites, and other complex regions. > > https://doi.org/10.1101/2021.07.12.452063 > > > Cheers, > simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-08 10:21 ` Nathan Dehnel @ 2023-04-11 8:37 ` Simon Tournier 2023-04-11 12:41 ` Nathan Dehnel 0 siblings, 1 reply; 21+ messages in thread From: Simon Tournier @ 2023-04-11 8:37 UTC (permalink / raw) To: Nathan Dehnel; +Cc: rprior, guix-devel Hi Nathan, Maybe there is a misunderstanding. :-) The subject is “Guideline for pre-trained ML model weight binaries”. My opinion on such guideline would to only consider the license of such data. Other considerations appear to me hard to be conclusive. What I am trying to express is that: 1) Bit-identical rebuild is worth, for sure!, and it addresses a class of attacks (e.g., Trusting trust described in 1984 [1]). Aside, I find this message by John Gilmore [2] very instructive about the history of bit-identical rebuilds. (Bit-identical rebuild had been considered by GNU in the early 90’s.) 2) Bit-identical rebuild is *not* the solution to all. Obviously. Many attacks are bit-identical. Consider the package ’python-pillow’, it builds bit-identically. But before c16add7fd9, it was subject to CVE-2022-45199. Only an human expertise to produce the patch [3] protects against the attack. Considering this, I am claiming that: a) Bit-identical re-train of ML models is similar to #2; other said that bit-identical re-training of ML model weights does not protect much against biased training. The only protection against biased training is by human expertise. Note that if the re-train is not bit-identical, what would be the conclusion about the trust? It falls under the cases of non bit-identical rebuild of packages as Julia or even Guile itself. b) The resources (human, financial, hardware, etc.) for re-training is, for most of the cases, not affordable. Not because it would be difficult or because the task is complex, this is covered by the point a), no it is because the requirements in term of resources is just to high. Consider that, for some cases where we do not have the resources, we already do not debootstrap. See GHC compiler (*) or Genomic references. And I am not saying it is impossible or we should not try, instead, I am saying we have to be pragmatic for some cases. Therefore, my opinion is that pre-trained ML model weight binaries should be included as any other data and the lack of debootstrapping is not an issue for inclusion in this particular cases. The question for inclusion about this pre-trained ML model binary weights is the license. Last, from my point of view, the tangential question is the size of such pre-trained ML model binary weights. I do not know if they fit the store. Well, that’s my opinion on this “Guidelines for pre-trained ML model weight binaries”. :-) (*) And Ricardo is training hard! See [4] and part 2 is yet published, IIRC. 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html Cheers, simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-11 8:37 ` Simon Tournier @ 2023-04-11 12:41 ` Nathan Dehnel 2023-04-12 9:32 ` Csepp 0 siblings, 1 reply; 21+ messages in thread From: Nathan Dehnel @ 2023-04-11 12:41 UTC (permalink / raw) To: Simon Tournier; +Cc: rprior, guix-devel a) Bit-identical re-train of ML models is similar to #2; other said that bit-identical re-training of ML model weights does not protect much against biased training. The only protection against biased training is by human expertise. Yeah, I didn't mean to give the impression that I thought bit-reproducibility was the silver bullet for AI backdoors with that analogy. I guess my argument is this: if they release the training info, either 1) it does not produce the bias/backdoor of the trained model, so there's no problem, or 2) it does, in which case an expert will be able to look at it and go "wait, that's not right", and will raise an alarm, and it will go public. The expert does not need to be affiliated with guix, but guix will eventually hear about it. Similar to how a normal security vulnerability works. b) The resources (human, financial, hardware, etc.) for re-training is, for most of the cases, not affordable. Not because it would be difficult or because the task is complex, this is covered by the point a), no it is because the requirements in term of resources is just to high. Maybe distributed substitutes could change that equation? On Tue, Apr 11, 2023 at 3:37 AM Simon Tournier <zimon.toutoune@gmail.com> wrote: > > Hi Nathan, > > Maybe there is a misunderstanding. :-) > > The subject is “Guideline for pre-trained ML model weight binaries”. My > opinion on such guideline would to only consider the license of such > data. Other considerations appear to me hard to be conclusive. > > > What I am trying to express is that: > > 1) Bit-identical rebuild is worth, for sure!, and it addresses a class > of attacks (e.g., Trusting trust described in 1984 [1]). Aside, I > find this message by John Gilmore [2] very instructive about the > history of bit-identical rebuilds. (Bit-identical rebuild had been > considered by GNU in the early 90’s.) > > 2) Bit-identical rebuild is *not* the solution to all. Obviously. > Many attacks are bit-identical. Consider the package > ’python-pillow’, it builds bit-identically. But before c16add7fd9, > it was subject to CVE-2022-45199. Only an human expertise to > produce the patch [3] protects against the attack. > > Considering this, I am claiming that: > > a) Bit-identical re-train of ML models is similar to #2; other said > that bit-identical re-training of ML model weights does not protect > much against biased training. The only protection against biased > training is by human expertise. > > Note that if the re-train is not bit-identical, what would be the > conclusion about the trust? It falls under the cases of non > bit-identical rebuild of packages as Julia or even Guile itself. > > b) The resources (human, financial, hardware, etc.) for re-training is, > for most of the cases, not affordable. Not because it would be > difficult or because the task is complex, this is covered by the > point a), no it is because the requirements in term of resources is > just to high. > > Consider that, for some cases where we do not have the resources, we > already do not debootstrap. See GHC compiler (*) or Genomic > references. And I am not saying it is impossible or we should not > try, instead, I am saying we have to be pragmatic for some cases. > > > Therefore, my opinion is that pre-trained ML model weight binaries > should be included as any other data and the lack of debootstrapping is > not an issue for inclusion in this particular cases. > > The question for inclusion about this pre-trained ML model binary > weights is the license. > > Last, from my point of view, the tangential question is the size of such > pre-trained ML model binary weights. I do not know if they fit the > store. > > Well, that’s my opinion on this “Guidelines for pre-trained ML model > weight binaries”. :-) > > > > (*) And Ricardo is training hard! See [4] and part 2 is yet published, > IIRC. > > 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf > 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html > 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch > 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html > > Cheers, > simon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) 2023-04-11 12:41 ` Nathan Dehnel @ 2023-04-12 9:32 ` Csepp 0 siblings, 0 replies; 21+ messages in thread From: Csepp @ 2023-04-12 9:32 UTC (permalink / raw) To: Nathan Dehnel; +Cc: Simon Tournier, rprior, guix-devel Nathan Dehnel <ncdehnel@gmail.com> writes: > a) Bit-identical re-train of ML models is similar to #2; other said > that bit-identical re-training of ML model weights does not protect > much against biased training. The only protection against biased > training is by human expertise. > > Yeah, I didn't mean to give the impression that I thought > bit-reproducibility was the silver bullet for AI backdoors with that > analogy. I guess my argument is this: if they release the training > info, either 1) it does not produce the bias/backdoor of the trained > model, so there's no problem, or 2) it does, in which case an expert > will be able to look at it and go "wait, that's not right", and will > raise an alarm, and it will go public. The expert does not need to be > affiliated with guix, but guix will eventually hear about it. Similar > to how a normal security vulnerability works. > > b) The resources (human, financial, hardware, etc.) for re-training is, > for most of the cases, not affordable. Not because it would be > difficult or because the task is complex, this is covered by the > point a), no it is because the requirements in term of resources is > just to high. > > Maybe distributed substitutes could change that equation? Probably not, it would require distributed *builds*. Right now Guix can't even use distcc, so it definitely can't use remote GPUs. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2023-07-04 20:04 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior 2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution. 2023-04-03 21:18 ` Jack Hill 2023-04-06 8:42 ` Simon Tournier 2023-04-06 13:41 ` Kyle 2023-04-06 14:53 ` Simon Tournier 2023-05-13 4:13 ` 宋文武 2023-05-15 11:18 ` Simon Tournier 2023-05-26 15:37 ` Ludovic Courtès 2023-05-29 3:57 ` zamfofex 2023-05-30 13:15 ` Simon Tournier 2023-07-02 19:51 ` Ludovic Courtès 2023-07-03 9:39 ` Simon Tournier 2023-07-04 13:05 ` zamfofex 2023-07-04 20:03 ` Vagrant Cascadian -- strict thread matches above, loose matches on Subject: below -- 2023-04-07 5:50 Nathan Dehnel 2023-04-07 9:42 ` Simon Tournier 2023-04-08 10:21 ` Nathan Dehnel 2023-04-11 8:37 ` Simon Tournier 2023-04-11 12:41 ` Nathan Dehnel 2023-04-12 9:32 ` Csepp
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.