Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
@ 2023-04-03 18:07 Ryan Prior
  2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
  2023-04-06  8:42 ` Simon Tournier
  0 siblings, 2 replies; 21+ messages in thread
From: Ryan Prior @ 2023-04-03 18:07 UTC (permalink / raw)
  To: Nicolas Graves, licensing@fsf.org; +Cc: guix-devel

Hi there FSF Licensing! (CC: Guix devel, Nicholas Graves) This morning I read through the FSDG to see if it gives any guidance on when machine learning model weights are appropriate for inclusion in a free system. It does not seem to offer much.

Many ML models are advertising themselves as "open source", including the llama model that Nicholas (quoted below) is interested in including into Guix. However, according to what I can find in Meta's announcement (https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) and the project's documentation (https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) the model itself is not covered by the GPLv3 but rather "a noncommercial license focused on research use cases." I cannot find the full text of this license anywhere in 20 minutes of searching, perhaps others have better ideas how to find it or perhaps the Meta team would provide a copy if we ask.

Free systems will see incentive to include trained models in their distributions to support use cases like automatic live transcription of audio, recognition of objects in photos and video, and natural language-driven help and documentation features. I hope we can update the FSDG to help ensure that any such inclusion fully meets the requirements of freedom for all our users.

Cheers,
Ryan

------- Original Message -------
On Monday, April 3rd, 2023 at 4:48 PM, Nicolas Graves via "Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote:

> 
> 
> 
> Hi Guix!
> 
> I've recently contributed a few tools that make a few OSS machine
> learning programs usable for Guix, namely nerd-dictation for dictation
> and llama-cpp as a converstional bot.
> 
> In the first case, I would also like to contribute parameters of some
> localized models so that they can be used more easily through Guix. I've
> already discussed this subject when submitting these patches, without a
> clear answer.
> 
> In the case of nerd-dictation, the model parameters that can be used
> are listed here : https://alphacephei.com/vosk/models
> 
> One caveat is that using all these models can take a lot of space on the
> servers, a burden which is not useful because no build step are really
> needed (except an unzip step). In this case, we can use the
> #:substitutable? #f flag. You can find an example of some of these
> packages right here :
> https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm
> 
> So my question is: Should we add this type of models in packages for
> Guix? If yes, where should we put them? In machine-learning.scm? In a
> new file machine-learning-models.scm (such a file would never need new
> modules, and it might avoid some confusion between the tools and the
> parameters needed to use the tools)?
> 
> 
> --
> Best regards,
> Nicolas Graves

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior
@ 2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
  2023-04-03 21:18   ` Jack Hill
  2023-04-06  8:42 ` Simon Tournier
  1 sibling, 1 reply; 21+ messages in thread
From: Nicolas Graves via Development of GNU Guix and the GNU System distribution. @ 2023-04-03 20:48 UTC (permalink / raw)
  To: Ryan Prior, licensing@fsf.org; +Cc: guix-devel

On 2023-04-03 18:07, Ryan Prior wrote:

> Hi there FSF Licensing! (CC: Guix devel, Nicholas Graves) This morning I read through the FSDG to see if it gives any guidance on when machine learning model weights are appropriate for inclusion in a free system. It does not seem to offer much.
>
> Many ML models are advertising themselves as "open source", including the llama model that Nicholas (quoted below) is interested in including into Guix. However, according to what I can find in Meta's announcement (https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) and the project's documentation (https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) the model itself is not covered by the GPLv3 but rather "a noncommercial license focused on research use cases." I cannot find the full text of this license anywhere in 20 minutes of searching, perhaps others have better ideas how to find it or perhaps the Meta team would provide a copy if we ask.

Just to be precise on llama, what I proposed was to include the port of
Facebook code to CPP, (llama.cpp, see ticket 62443 on guix-patches),
which itself has a license.

The weight themselves indeed do not have a license. You can only
download them through torrents because they were leaked. For this model
in particular, we can't include them in Guix indeed (also because of
their sheer size).

The other case I evoked and one that is more mature is the case of VOSK
audio recognition, which model binaries have an Apache license (you can
find them here: https://alphacephei.com/vosk/models

>
> Free systems will see incentive to include trained models in their distributions to support use cases like automatic live transcription of audio, recognition of objects in photos and video, and natural language-driven help and documentation features. I hope we can update the FSDG to help ensure that any such inclusion fully meets the requirements of freedom for all our users.

Thanks for this email and the question about these guidelines, Ryan. I
would be glad to help if I can.
>
> Cheers,
> Ryan
>
>
> ------- Original Message -------
> On Monday, April 3rd, 2023 at 4:48 PM, Nicolas Graves via "Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote:
>
>
>>
>>
>>
>> Hi Guix!
>>
>> I've recently contributed a few tools that make a few OSS machine
>> learning programs usable for Guix, namely nerd-dictation for dictation
>> and llama-cpp as a converstional bot.
>>
>> In the first case, I would also like to contribute parameters of some
>> localized models so that they can be used more easily through Guix. I've
>> already discussed this subject when submitting these patches, without a
>> clear answer.
>>
>> In the case of nerd-dictation, the model parameters that can be used
>> are listed here : https://alphacephei.com/vosk/models
>>
>> One caveat is that using all these models can take a lot of space on the
>> servers, a burden which is not useful because no build step are really
>> needed (except an unzip step). In this case, we can use the
>> #:substitutable? #f flag. You can find an example of some of these
>> packages right here :
>> https://git.sr.ht/~ngraves/dotfiles/tree/main/item/packages.scm
>>
>> So my question is: Should we add this type of models in packages for
>> Guix? If yes, where should we put them? In machine-learning.scm? In a
>> new file machine-learning-models.scm (such a file would never need new
>> modules, and it might avoid some confusion between the tools and the
>> parameters needed to use the tools)?
>>
>>
>> --
>> Best regards,
>> Nicolas Graves

--
Best regards,
Nicolas Graves


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
@ 2023-04-03 21:18   ` Jack Hill
  0 siblings, 0 replies; 21+ messages in thread
From: Jack Hill @ 2023-04-03 21:18 UTC (permalink / raw)
  To: Nicolas Graves; +Cc: Ryan Prior, licensing@fsf.org, guix-devel

On Mon, 3 Apr 2023, Nicolas Graves via "Development of GNU Guix and the GNU System distribution. wrote:

> Just to be precise on llama, what I proposed was to include the port of
> Facebook code to CPP, (llama.cpp, see ticket 62443 on guix-patches),
> which itself has a license.
>
> The weight themselves indeed do not have a license. You can only
> download them through torrents because they were leaked. For this model
> in particular, we can't include them in Guix indeed (also because of
> their sheer size).
>
> The other case I evoked and one that is more mature is the case of VOSK
> audio recognition, which model binaries have an Apache license (you can
> find them here: https://alphacephei.com/vosk/models

One more case to consider: rnnoise, which is already packaged in Guix. If 
the licence filed is to be believed the weights are bsd3. For how they are 
created, see the `TRAINING-README` file in the source. That procedure 
produced `rnn_data.{c,h}` files which are present in the outputs of `guix 
build -S rnnoise`.

Best,
Jack

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior
  2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
@ 2023-04-06  8:42 ` Simon Tournier
  2023-04-06 13:41   ` Kyle
  2023-05-13  4:13   ` 宋文武
  1 sibling, 2 replies; 21+ messages in thread
From: Simon Tournier @ 2023-04-06  8:42 UTC (permalink / raw)
  To: Ryan Prior, Nicolas Graves, licensing@fsf.org; +Cc: guix-devel

Hi,

On Mon, 03 Apr 2023 at 18:07, Ryan Prior <rprior@protonmail.com> wrote:

> Hi there FSF Licensing! (CC: Guix devel, Nicholas Graves) This morning
> I read through the FSDG to see if it gives any guidance on when
> machine learning model weights are appropriate for inclusion in a free
> system. It does not seem to offer much. 

Years ago, I asked to FSF and Stallman how to deal with that and I had
never got an answer back.  Anyway! :-)

Debian folks discussed such topic [1,2] but I do not know if they have
an “official” policy.

I remember we discussed on guix-devel or guix-patches similar topic some
years ago – but I do not find back the thread.

For what my opinion is worth, I think that machine learning model
weights should be considered as any other data (images, text files,
translated strings, etc.) and thus they are appropriated for inclusion
or not depending on if their license is compliant.

Since it is computing, we could ask about the bootstrap of such
generated data.  I think it is a slippery slope because it is totally
not affordable to re-train for many cases: (1) we would not have the
hardware resources from a practical point of view,, (2) it is almost
impossible to tackle the source of indeterminism (the optimization is
too entailed with randomness).  From my point of view, pre-trained
weights should be considered as the output of a (numerical) experiment,
similarly as we include other experimental data (from genome to
astronomy dataset).

1: https://salsa.debian.org/deeplearning-team/ml-policy
2: https://people.debian.org/~lumin/debian-dl.html

Cheers,
simon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-06  8:42 ` Simon Tournier
@ 2023-04-06 13:41   ` Kyle
  2023-04-06 14:53     ` Simon Tournier
  2023-05-13  4:13   ` 宋文武
  1 sibling, 1 reply; 21+ messages in thread
From: Kyle @ 2023-04-06 13:41 UTC (permalink / raw)
  To: guix-devel, Simon Tournier, Ryan Prior, Nicolas Graves,
	licensing@fsf.org

>Since it is computing, we could ask about the bootstrap of such
>generated data.  I think it is a slippery slope because it is totally
>not affordable to re-train for many cases: (1) we would not have the
>hardware resources from a practical point of view,, (2) it is almost
>impossible to tackle the source of indeterminism (the optimization is
>too entailed with randomness). 

I have only seen situations where the optimization is "too entailed with randomness" when models are trained on proprietary GPUs with specific settings. Otherwise, pseudo-random seeds are perfectly sufficient to remove the indeterminism. 

=> https://discourse.julialang.org/t/flux-reproducibility-of-gpu-experiments/62092

Many people think that "ultimate" reproducibility is not a practical either. It's always going to be easier in the short term to take shortcuts which make conclusions dependent on secret sauce which few can understand.

=> https://hpc.guix.info/blog/2022/07/is-reproducibility-practical/

 From my point of view, pre-trained
>weights should be considered as the output of a (numerical) experiment,
>similarly as we include other experimental data (from genome to
>astronomy dataset).

I think its a stretch to consider a data compression as an experiment. In experiments I am always finding mistakes which confuse the interpretation hidden by prematurely compressing data, e.g. by taking inappropriate averages. Don't confuse the actual experimental results with dubious data processing steps.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-06 13:41   ` Kyle
@ 2023-04-06 14:53     ` Simon Tournier
  0 siblings, 0 replies; 21+ messages in thread
From: Simon Tournier @ 2023-04-06 14:53 UTC (permalink / raw)
  To: Kyle; +Cc: guix-devel, Ryan Prior, Nicolas Graves

Hi,

On Thu, 6 Apr 2023 at 15:41, Kyle <kyle@posteo.net> wrote:

> I have only seen situations where the optimization is "too entailed with randomness" when models are trained on proprietary GPUs with specific settings. Otherwise, pseudo-random seeds are perfectly sufficient to remove the indeterminism.

Feel free to pick real-world model using 15 billions of parameters and
then to train it again.  And if you succeed, feel free to train it
again to have bit-to-bit reproducibility.  Bah the cost (CPU or GPU
power and at the end the electricity, so real money) would not be
nothing and I am far to be convinced that paying this bill is worth,
reproducibility speaking.

> => https://discourse.julialang.org/t/flux-reproducibility-of-gpu-experiments/62092

Ahah!  I am laughing when Julia is already not reproducible itself.

https://issues.guix.gnu.org/22304
https://issues.guix.gnu.org/47354

And upstream does not care much as you can see

https://github.com/JuliaLang/julia/issues/25900
https://github.com/JuliaLang/julia/issues/34753

Well, years ago Nicolô made a patch for improving but it has not been
merged yet.

For instance, some people are trying to have "reproducible" benchmark
of machine learning,

https://benchopt.github.io/

and last time I checked, they have good times and a lot of fun. ;-)
Well, I would be less confident than "pseudo-random seeds are
perfectly sufficient to remove the indeterminism". :-)

> Many people think that "ultimate" reproducibility is not a practical either. It's always going to be easier in the short term to take shortcuts which make conclusions dependent on secret sauce which few can understand.
>
> => https://hpc.guix.info/blog/2022/07/is-reproducibility-practical/

Depending on the size of the model, training it again is not
practical.  Similarly, the computation for predicting weather forecast
is not practically reproducible and no one is ready to put the amount
of money on the table to do so.  Instead, people are exchanging
dataset of pressure maps.

Bit-to-bit reproducibility is a mean for verifying the correctness
between some claim and what had concretely be done.  But that's not
the only mean.

Speaking about some scientific method point of view, it is false to
think that it is possible to reproduce all or that it is possible to
reproduce all.  Consider theoretical physics experiment by LHC; in
this case, the confidence in the result is not done using independent
bit-to-bit reproducibility but by as much as possible transparency of
all the stages.

Moreover, what Ludo wrote in this blog post is their own points of
view and for example I do not share all.  Anyway. :-)

For sure, bit-to-bit reproducible is not a end for trusting one result
but a mean among many others.  It is possible to have bit-to-bit
reproducible results that are wrong and other results impossible to
reproduce bit-to-bit that are correct.

Well, back to Julia, since part of Julia is not bit-to-bit
reproducible, does it mean that the scientific outputs generated using
Julia are not trustable?

All that said, if the re-computation of the weights is affordable
because the size of the model is affordable, yes for sure, we could
try.  But from my point of view, the re-computation of the weights
should not be blocking for inclusion.  What should be blocking is the
license of this data (weights).

>  From my point of view, pre-trained
> >weights should be considered as the output of a (numerical) experiment,
> >similarly as we include other experimental data (from genome to
> >astronomy dataset).
>
> I think its a stretch to consider a data compression as an experiment. In experiments I am always finding mistakes which confuse the interpretation hidden by prematurely compressing data, e.g. by taking inappropriate averages. Don't confuse the actual experimental results with dubious data processing steps.

I do not see where I speak about data compression.  Anyway. :-)

Well, I claim that data processing is an experiment.  There is no
"actual experiment" and "data processing".  It is a continuum.

Today, any instrument generating data does internally numerical
processing.  Other said, what you consider as your raw inputs is
considered as output by other, so by following the recursive problem,
the true original raw material is physical samples and that is what we
should package, i.e., we should send by post mail these physical
samples and then reproduce all.  Here, I am stretching. ;-)

The genomic references that we already packaged are also the result of
"data processing" that no one is redoing.  I do not see any difference
between the weights of machine learning models and these genomic
references; they are both generated data resulting from a experiment
(broad meaning).

Cheers,
simon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
@ 2023-04-07  5:50 Nathan Dehnel
  2023-04-07  9:42 ` Simon Tournier
  0 siblings, 1 reply; 21+ messages in thread
From: Nathan Dehnel @ 2023-04-07  5:50 UTC (permalink / raw)
  To: rprior, guix-devel

I am uncomfortable with including ML models without their training
data available. It is possible to hide backdoors in them.
https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-07  5:50 Nathan Dehnel
@ 2023-04-07  9:42 ` Simon Tournier
  2023-04-08 10:21   ` Nathan Dehnel
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Tournier @ 2023-04-07  9:42 UTC (permalink / raw)
  To: Nathan Dehnel, rprior, guix-devel

Hi,

On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote:

> I am uncomfortable with including ML models without their training
> data available. It is possible to hide backdoors in them.
> https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/

Thanks for pointing this article!  And some non-mathematical part of the
original article [1] are also worth to give a look. :-)

First please note that we are somehow in the case “The Open Box”, IMHO:

        But what if a company knows exactly what kind of model it wants,
        and simply lacks the computational resources to train it? Such a
        company would specify what network architecture and training
        procedure to use, and it would examine the trained model
        closely.

And yeah there is nothing new ;-) when one says that the result could be
biased by the person that produced the data.  Yeah, we have to trust the
trainer as we are trusting the people who generated “biased” (*) genomic
references.

Well, it is very interesting – and scary – to see how to theoretically
exploit “misclassify adversarial examples“ as described e.g. by [2].

This raises questions about “Verifiable Delegation of Learning”.

From my point of view, the tackle of such biased weights is not via
re-learning because how to draw the line between biased weights,
mistakes on their side, mistakes on our side, etc. and it requires a
high level of expertise to complete a full re-learning.  Instead, it
should come from the ML community that should standardize formal methods
for verifying that the training had not been biased, IMHO.

2: https://arxiv.org/abs/1412.6572

(*) biased genomic references, for one example among many others:

        Relatedly, reports have persisted of major artifacts that arise
        when identifying variants relative to GRCh38, such as an
        apparent imbalance between insertions and deletions (indels)
        arising from systematic mis-assemblies in GRCh38
        [15–17]. Overall, these errors and omissions in GRCh38 introduce
        biases in genomic analyses, particularly in centromeres,
        satellites, and other complex regions.

        https://doi.org/10.1101/2021.07.12.452063

Cheers,
simon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-07  9:42 ` Simon Tournier
@ 2023-04-08 10:21   ` Nathan Dehnel
  2023-04-11  8:37     ` Simon Tournier
  0 siblings, 1 reply; 21+ messages in thread
From: Nathan Dehnel @ 2023-04-08 10:21 UTC (permalink / raw)
  To: Simon Tournier; +Cc: rprior, guix-devel

>From my point of view, the tackle of such biased weights is not via
re-learning because how to draw the line between biased weights,
mistakes on their side, mistakes on our side, etc. and it requires a
high level of expertise to complete a full re-learning.
This strikes me as similar to being in the 80s, when Stallman was
writing the GPL, years before Nix was invented, and saying "the
solution to backdoors in executables is not access to source code due
to the difficulty of compiling from scratch for the average user and
due to the difficulty of making bit-reproducible binaries." Like, bit
reproducibility WAS possible, it was just difficult, so practically
speaking users had to use distro binaries they couldn't fully trust.
So some of the benefits of the source code being available were rather
theoretical for a while. So this argument strikes me as pre-emptively
compromising one's principles based on the presumption that a new
technology will never come along that allows one to practically
exploit the benefits of said principles.

>Instead, it
should come from the ML community that should standardize formal methods
for verifying that the training had not been biased, IMHO.
What "formal methods" for that are known? As per the article, the
hiding of the backdoor in the "whitebox" scenario is cryptographically
secure in the specific case, with that same possibility open for the
general case.

On Fri, Apr 7, 2023 at 5:53 AM Simon Tournier <zimon.toutoune@gmail.com> wrote:
>
> Hi,
>
> On ven., 07 avril 2023 at 00:50, Nathan Dehnel <ncdehnel@gmail.com> wrote:
>
> > I am uncomfortable with including ML models without their training
> > data available. It is possible to hide backdoors in them.
> > https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/
>
> Thanks for pointing this article!  And some non-mathematical part of the
> original article [1] are also worth to give a look. :-)
>
> First please note that we are somehow in the case “The Open Box”, IMHO:
>
>         But what if a company knows exactly what kind of model it wants,
>         and simply lacks the computational resources to train it? Such a
>         company would specify what network architecture and training
>         procedure to use, and it would examine the trained model
>         closely.
>
> And yeah there is nothing new ;-) when one says that the result could be
> biased by the person that produced the data.  Yeah, we have to trust the
> trainer as we are trusting the people who generated “biased” (*) genomic
> references.
>
> Well, it is very interesting – and scary – to see how to theoretically
> exploit “misclassify adversarial examples“ as described e.g. by [2].
>
> This raises questions about “Verifiable Delegation of Learning”.
>
> From my point of view, the tackle of such biased weights is not via
> re-learning because how to draw the line between biased weights,
> mistakes on their side, mistakes on our side, etc. and it requires a
> high level of expertise to complete a full re-learning.  Instead, it
> should come from the ML community that should standardize formal methods
> for verifying that the training had not been biased, IMHO.
>
> 2: https://arxiv.org/abs/1412.6572
>
> (*) biased genomic references, for one example among many others:
>
>         Relatedly, reports have persisted of major artifacts that arise
>         when identifying variants relative to GRCh38, such as an
>         apparent imbalance between insertions and deletions (indels)
>         arising from systematic mis-assemblies in GRCh38
>         [15–17]. Overall, these errors and omissions in GRCh38 introduce
>         biases in genomic analyses, particularly in centromeres,
>         satellites, and other complex regions.
>
>         https://doi.org/10.1101/2021.07.12.452063
>
>
> Cheers,
> simon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-08 10:21   ` Nathan Dehnel
@ 2023-04-11  8:37     ` Simon Tournier
  2023-04-11 12:41       ` Nathan Dehnel
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Tournier @ 2023-04-11  8:37 UTC (permalink / raw)
  To: Nathan Dehnel; +Cc: rprior, guix-devel

Hi Nathan,

Maybe there is a misunderstanding. :-)

The subject is “Guideline for pre-trained ML model weight binaries”.  My
opinion on such guideline would to only consider the license of such
data.  Other considerations appear to me hard to be conclusive.


What I am trying to express is that:

 1) Bit-identical rebuild is worth, for sure!, and it addresses a class
    of attacks (e.g., Trusting trust described in 1984 [1]).  Aside, I
    find this message by John Gilmore [2] very instructive about the
    history of bit-identical rebuilds. (Bit-identical rebuild had been
    considered by GNU in the early 90’s.)

 2) Bit-identical rebuild is *not* the solution to all.  Obviously.
    Many attacks are bit-identical.  Consider the package
    ’python-pillow’, it builds bit-identically.  But before c16add7fd9,
    it was subject to CVE-2022-45199.  Only an human expertise to
    produce the patch [3] protects against the attack.

Considering this, I am claiming that:

 a) Bit-identical re-train of ML models is similar to #2; other said
    that bit-identical re-training of ML model weights does not protect
    much against biased training.  The only protection against biased
    training is by human expertise.

    Note that if the re-train is not bit-identical, what would be the
    conclusion about the trust?  It falls under the cases of non
    bit-identical rebuild of packages as Julia or even Guile itself.

 b) The resources (human, financial, hardware, etc.) for re-training is,
    for most of the cases, not affordable.  Not because it would be
    difficult or because the task is complex, this is covered by the
    point a), no it is because the requirements in term of resources is
    just to high.

    Consider that, for some cases where we do not have the resources, we
    already do not debootstrap.  See GHC compiler (*) or Genomic
    references.  And I am not saying it is impossible or we should not
    try, instead, I am saying we have to be pragmatic for some cases.


Therefore, my opinion is that pre-trained ML model weight binaries
should be included as any other data and the lack of debootstrapping is
not an issue for inclusion in this particular cases.

The question for inclusion about this pre-trained ML model binary
weights is the license.

Last, from my point of view, the tangential question is the size of such
pre-trained ML model binary weights.  I do not know if they fit the
store.

Well, that’s my opinion on this “Guidelines for pre-trained ML model
weight binaries”. :-)



(*) And Ricardo is training hard! See [4] and part 2 is yet published,
IIRC.

1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html
3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch
4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html

Cheers,
simon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-11  8:37     ` Simon Tournier
@ 2023-04-11 12:41       ` Nathan Dehnel
  2023-04-12  9:32         ` Csepp
  0 siblings, 1 reply; 21+ messages in thread
From: Nathan Dehnel @ 2023-04-11 12:41 UTC (permalink / raw)
  To: Simon Tournier; +Cc: rprior, guix-devel

 a) Bit-identical re-train of ML models is similar to #2; other said
    that bit-identical re-training of ML model weights does not protect
    much against biased training.  The only protection against biased
    training is by human expertise.

Yeah, I didn't mean to give the impression that I thought
bit-reproducibility was the silver bullet for AI backdoors with that
analogy. I guess my argument is this: if they release the training
info, either 1) it does not produce the bias/backdoor of the trained
model, so there's no problem, or 2) it does, in which case an expert
will be able to look at it and go "wait, that's not right", and will
raise an alarm, and it will go public. The expert does not need to be
affiliated with guix, but guix will eventually hear about it. Similar
to how a normal security vulnerability works.

 b) The resources (human, financial, hardware, etc.) for re-training is,
    for most of the cases, not affordable.  Not because it would be
    difficult or because the task is complex, this is covered by the
    point a), no it is because the requirements in term of resources is
    just to high.

Maybe distributed substitutes could change that equation?

On Tue, Apr 11, 2023 at 3:37 AM Simon Tournier <zimon.toutoune@gmail.com> wrote:
>
> Hi Nathan,
>
> Maybe there is a misunderstanding. :-)
>
> The subject is “Guideline for pre-trained ML model weight binaries”.  My
> opinion on such guideline would to only consider the license of such
> data.  Other considerations appear to me hard to be conclusive.
>
>
> What I am trying to express is that:
>
>  1) Bit-identical rebuild is worth, for sure!, and it addresses a class
>     of attacks (e.g., Trusting trust described in 1984 [1]).  Aside, I
>     find this message by John Gilmore [2] very instructive about the
>     history of bit-identical rebuilds. (Bit-identical rebuild had been
>     considered by GNU in the early 90’s.)
>
>  2) Bit-identical rebuild is *not* the solution to all.  Obviously.
>     Many attacks are bit-identical.  Consider the package
>     ’python-pillow’, it builds bit-identically.  But before c16add7fd9,
>     it was subject to CVE-2022-45199.  Only an human expertise to
>     produce the patch [3] protects against the attack.
>
> Considering this, I am claiming that:
>
>  a) Bit-identical re-train of ML models is similar to #2; other said
>     that bit-identical re-training of ML model weights does not protect
>     much against biased training.  The only protection against biased
>     training is by human expertise.
>
>     Note that if the re-train is not bit-identical, what would be the
>     conclusion about the trust?  It falls under the cases of non
>     bit-identical rebuild of packages as Julia or even Guile itself.
>
>  b) The resources (human, financial, hardware, etc.) for re-training is,
>     for most of the cases, not affordable.  Not because it would be
>     difficult or because the task is complex, this is covered by the
>     point a), no it is because the requirements in term of resources is
>     just to high.
>
>     Consider that, for some cases where we do not have the resources, we
>     already do not debootstrap.  See GHC compiler (*) or Genomic
>     references.  And I am not saying it is impossible or we should not
>     try, instead, I am saying we have to be pragmatic for some cases.
>
>
> Therefore, my opinion is that pre-trained ML model weight binaries
> should be included as any other data and the lack of debootstrapping is
> not an issue for inclusion in this particular cases.
>
> The question for inclusion about this pre-trained ML model binary
> weights is the license.
>
> Last, from my point of view, the tangential question is the size of such
> pre-trained ML model binary weights.  I do not know if they fit the
> store.
>
> Well, that’s my opinion on this “Guidelines for pre-trained ML model
> weight binaries”. :-)
>
>
>
> (*) And Ricardo is training hard! See [4] and part 2 is yet published,
> IIRC.
>
> 1: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
> 2: https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html
> 3: https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch
> 4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html
>
> Cheers,
> simon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-11 12:41       ` Nathan Dehnel
@ 2023-04-12  9:32         ` Csepp
  0 siblings, 0 replies; 21+ messages in thread
From: Csepp @ 2023-04-12  9:32 UTC (permalink / raw)
  To: Nathan Dehnel; +Cc: Simon Tournier, rprior, guix-devel


Nathan Dehnel <ncdehnel@gmail.com> writes:

>  a) Bit-identical re-train of ML models is similar to #2; other said
>     that bit-identical re-training of ML model weights does not protect
>     much against biased training.  The only protection against biased
>     training is by human expertise.
>
> Yeah, I didn't mean to give the impression that I thought
> bit-reproducibility was the silver bullet for AI backdoors with that
> analogy. I guess my argument is this: if they release the training
> info, either 1) it does not produce the bias/backdoor of the trained
> model, so there's no problem, or 2) it does, in which case an expert
> will be able to look at it and go "wait, that's not right", and will
> raise an alarm, and it will go public. The expert does not need to be
> affiliated with guix, but guix will eventually hear about it. Similar
> to how a normal security vulnerability works.
>
>  b) The resources (human, financial, hardware, etc.) for re-training is,
>     for most of the cases, not affordable.  Not because it would be
>     difficult or because the task is complex, this is covered by the
>     point a), no it is because the requirements in term of resources is
>     just to high.
>
> Maybe distributed substitutes could change that equation?

Probably not, it would require distributed *builds*.  Right now Guix
can't even use distcc, so it definitely can't use remote GPUs.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-04-06  8:42 ` Simon Tournier
  2023-04-06 13:41   ` Kyle
@ 2023-05-13  4:13   ` 宋文武
  2023-05-15 11:18     ` Simon Tournier
  1 sibling, 1 reply; 21+ messages in thread
From: 宋文武 @ 2023-05-13  4:13 UTC (permalink / raw)
  To: Simon Tournier; +Cc: Ryan Prior, Nicolas Graves, guix-devel, zamfofex

Simon Tournier <zimon.toutoune@gmail.com> writes:

> Since it is computing, we could ask about the bootstrap of such
> generated data.  I think it is a slippery slope because it is totally
> not affordable to re-train for many cases: (1) we would not have the
> hardware resources from a practical point of view,, (2) it is almost
> impossible to tackle the source of indeterminism (the optimization is
> too entailed with randomness).  From my point of view, pre-trained
> weights should be considered as the output of a (numerical) experiment,
> similarly as we include other experimental data (from genome to
> astronomy dataset).
>
> 1: https://salsa.debian.org/deeplearning-team/ml-policy
> 2: https://people.debian.org/~lumin/debian-dl.html
>

Hello, zamfofex submited a package 'lc0', Leela Chess Zero” (a chess
engine) with ML model, also it turn out that we already had 'stockfish'
a similiar one with pre-trained model packaged.  Does we reached a
conclusion (so lc0 can also be accepted)?  Or should we remove 'stockfish'?

Thanks!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-05-13  4:13   ` 宋文武
@ 2023-05-15 11:18     ` Simon Tournier
  2023-05-26 15:37       ` Ludovic Courtès
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Tournier @ 2023-05-15 11:18 UTC (permalink / raw)
  To: 宋文武; +Cc: Ryan Prior, Nicolas Graves, guix-devel, zamfofex

Hi,

On sam., 13 mai 2023 at 12:13, 宋文武 <iyzsong@envs.net> wrote:

> Hello, zamfofex submited a package 'lc0', Leela Chess Zero” (a chess
> engine) with ML model, also it turn out that we already had 'stockfish'
> a similiar one with pre-trained model packaged.  Does we reached a
> conclusion (so lc0 can also be accepted)?  Or should we remove 'stockfish'?

Well, I do not know if we have reached a conclusion.  From my point of
view, both can be included *if* their licenses are compatible with Free
Software – included the weights (pre-trained model) as licensed data.

Cheers,
simon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-05-15 11:18     ` Simon Tournier
@ 2023-05-26 15:37       ` Ludovic Courtès
  2023-05-29  3:57         ` zamfofex
  2023-05-30 13:15         ` Simon Tournier
  0 siblings, 2 replies; 21+ messages in thread
From: Ludovic Courtès @ 2023-05-26 15:37 UTC (permalink / raw)
  To: Simon Tournier
  Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel,
	zamfofex

Hello,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

> On sam., 13 mai 2023 at 12:13, 宋文武 <iyzsong@envs.net> wrote:
>
>> Hello, zamfofex submited a package 'lc0', Leela Chess Zero” (a chess
>> engine) with ML model, also it turn out that we already had 'stockfish'
>> a similiar one with pre-trained model packaged.  Does we reached a
>> conclusion (so lc0 can also be accepted)?  Or should we remove 'stockfish'?
>
> Well, I do not know if we have reached a conclusion.  From my point of
> view, both can be included *if* their licenses are compatible with Free
> Software – included the weights (pre-trained model) as licensed data.

We discussed it in 2019:

  https://issues.guix.gnu.org/36071

This LWN article on the debate that then took place in Debian is
insightful:

  https://lwn.net/Articles/760142/

To me, there is no doubt that neural networks are a threat to user
autonomy: hard to train by yourself without very expensive hardware,
next to impossible without proprietary software, plus you need that huge
amount of data available to begin with.

As a project, we don’t have guidelines about this though.  I don’t know
if we can come up with general guidelines or if we should, at least as a
start, look at things on a case-by-case basis.

Ludo’.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-05-26 15:37       ` Ludovic Courtès
@ 2023-05-29  3:57         ` zamfofex
  2023-05-30 13:15         ` Simon Tournier
  1 sibling, 0 replies; 21+ messages in thread
From: zamfofex @ 2023-05-29  3:57 UTC (permalink / raw)
  To: Ludovic Courtès, Simon Tournier, 宋文武
  Cc: Ryan Prior, Nicolas Graves, guix-devel

> To me, there is no doubt that neural networks are a threat to user
> autonomy: hard to train by yourself without very expensive hardware,
> next to impossible without proprietary software, plus you need that huge
> amount of data available to begin with.
> 
> As a project, we don’t have guidelines about this though.  I don’t know
> if we can come up with general guidelines or if we should, at least as a
> start, look at things on a case-by-case basis.

I feel like it’s important to have a guideline for this, at least if the issue becomes recurrent too frequently.

To me, a sensible *base criterion* is whether the user is able to practically produce their own networks (either from scratch, or by using the an existing networks) using free software alone. I feel like this solves the issue of user autonomy being in risk.

By “practically produce”, I mean within reasonable time (somewhere between a few minutes and a few weeks depending on the scope) and using exclusively hardware they physically own (assuming they own reasonbly recent hardware to run Guix, at least).

The effect is that the user shouldn’t be bound to the provided networks, and should be able to train their own for their own purposes if they so choose, even if using the existing networks during that training. (And in the context of Guix, the neural network needs to be packaged for the user to be able to use it that way.)

Regarding Lc0 specifically, that is already possible! The Lc0 project has a training client that can use existing networks and a set of configurations to train your own special‐purpose network. (And although this client supports proprietary software, it is able to run using exclusively free software too.) In fact, there are already community‐provided networks for Lc0[1], which sometimes can play even more accurately than the official ones (or otherwise play differently in various specialised ways).

Of course, this might seem very dissatisfying in the same way as providing binary seeds for software programs is. In the sense that if you require an existing network to further train networks, rather than being able to start a network from scratch (in this case). But I feel like (at least under my “base criterion”), the effects of this to the user are not as significant, since the effects of the networks are limited compared to those of actual programs.

In the sense that, even though you might want to argue that “the network affects the behavior of the program using it” in the same way as “a Python source file affects the behavior of its interpreter”, the effect of the network file for the program is limited compared to that of a Python program. It’s much more like how an image would affect the affect the behavior of the program displaying it. More concretely, there isn’t a trust issue to be solved, because the network doesn’t have as many capabilities (theoretical or practical) as a program does.

I say “practical capabilities” in the sense of being access user resources and data for purposes they don’t want. (E.g. By accessing/modifying their files, sharing their data through the Internet without their acknowledgement, etc.)

I say “theoretical capabilities” in the sense of doing things the user doesn’t want nor expects, i.e. thinking about using computations as a tool for some purpose. (E.g. Even sandboxed/containerised programs can be harmful, because the program could behave in a way the user doesn’t want without letting the user do something about it.)

The only autonomy‐disrespecting (or perhaps rather freedom‐disrespecting) issue is when the user is stuck with the provided network, and doesn’t have any tools to (practically) change how the program behaves by creating a different network that suits their needs. (Which is what my “base criterion” tries to defend against.) This is not the case with Lc0, as I said.

Finally, I will also note that, in addition to the aforementioned[2] fact that Stockfish (already packaged) does use pre‐trained neural networks too, the lastest versions of Stockfish (from 14 onward) use neural networks that have themselves been indirectly trained using the networks from the Lc0 project.[3]

[1]: See <https://lczero.org/play/networks/basics/#training-data>
[2]: It was mentioned in <https://issues.guix.gnu.org/63088>
[3]: See <https://stockfishchess.org/blog/2021/stockfish-14/>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-05-26 15:37       ` Ludovic Courtès
  2023-05-29  3:57         ` zamfofex
@ 2023-05-30 13:15         ` Simon Tournier
  2023-07-02 19:51           ` Ludovic Courtès
  1 sibling, 1 reply; 21+ messages in thread
From: Simon Tournier @ 2023-05-30 13:15 UTC (permalink / raw)
  To: Ludovic Courtès
  Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel,
	zamfofex

Hi Ludo,

On ven., 26 mai 2023 at 17:37, Ludovic Courtès <ludo@gnu.org> wrote:

>> Well, I do not know if we have reached a conclusion.  From my point of
>> view, both can be included *if* their licenses are compatible with Free
>> Software – included the weights (pre-trained model) as licensed data.
>
> We discussed it in 2019:
>
>   https://issues.guix.gnu.org/36071

Your concern in this thread was:

        My point is about whether these trained neural network data are
        something that we could distribute per the FSDG.

        https://issues.guix.gnu.org/36071#3-lineno21

and we discussed this specific concern for the package leela-zero.
Quoting 3 messages:

        Perhaps we could do the same, but I’d like to hear what others think.

        Back to this patch: I think it’s fine to accept it as long as the
        software necessary for training is included.

        The whole link is worth a click since there seems to be a ‘server
        component’ involved as well.

        https://issues.guix.gnu.org/36071#3-lineno31
        https://issues.guix.gnu.org/36071#5-lineno52
        https://issues.guix.gnu.org/36071#6-lineno18


And somehow I am rising the same concern for packages using weights.  We
could discuss case-by-case, instead I find important to sketch
guidelines about the weights because it would help to decide what to do
with neuronal networks; as “Leela Chess Zero” [1] or others (see below).

1: https://issues.guix.gnu.org/63088


> This LWN article on the debate that then took place in Debian is
> insightful:
>
>   https://lwn.net/Articles/760142/

As pointed in #36071 mentioned above, this LWN article is a digest of
some Debian discussion, and it is also worth to give a look to the raw
material (arguments):

https://lists.debian.org/debian-devel/2018/07/msg00153.html


> To me, there is no doubt that neural networks are a threat to user
> autonomy: hard to train by yourself without very expensive hardware,
> next to impossible without proprietary software, plus you need that huge
> amount of data available to begin with.

About the “others” from above, please note that GNU Backgamon, already
packaged in Guix with the name ’gnubg’, asks similar questions. :-)

Quoting the webpage [2]:

        Tournament match and money session cube handling and cubeful
        play. All governed by underlying cubeless money game based
        neural networks.


As Russ Allbery is pointing [3] – similarly as I tried to do in this
thread – it seems hard to distinguish the data resulting from a
pre-processing as some training to the data just resulting from good
fitted parameters.


2: https://www.gnu.org/software/gnubg/
3: https://lwn.net/Articles/760199/


> As a project, we don’t have guidelines about this though.  I don’t know
> if we can come up with general guidelines or if we should, at least as a
> start, look at things on a case-by-case basis.

Somehow, if we do not have guidelines for helping in deciding, it makes
harder the review of #63088 [1] asking the inclusion of lc0 or it makes
hard to know what to do about GNU Backgamon.

On these specific cases, what do we do? :-)


Cheers,
simon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-05-30 13:15         ` Simon Tournier
@ 2023-07-02 19:51           ` Ludovic Courtès
  2023-07-03  9:39             ` Simon Tournier
  0 siblings, 1 reply; 21+ messages in thread
From: Ludovic Courtès @ 2023-07-02 19:51 UTC (permalink / raw)
  To: Simon Tournier
  Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel,
	zamfofex

Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

> Somehow, if we do not have guidelines for helping in deciding, it makes
> harder the review of #63088 [1] asking the inclusion of lc0 or it makes
> hard to know what to do about GNU Backgamon.
>
> On these specific cases, what do we do? :-)

Someone™ has to invest time in studying this specific case, look at what
others like Debian are doing, and seek consensus on a way forward.

Based on that, perhaps Someone™ can generalize that decision-making
process into more generic guidelines.  It would definitely be
beneficial.

:-)

(Yes, that’s very hand-wavy, but that’s really what needs to happen!)

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-07-02 19:51           ` Ludovic Courtès
@ 2023-07-03  9:39             ` Simon Tournier
  2023-07-04 13:05               ` zamfofex
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Tournier @ 2023-07-03  9:39 UTC (permalink / raw)
  To: Ludovic Courtès
  Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel,
	zamfofex

Hi,

On Sun, 02 Jul 2023 at 21:51, Ludovic Courtès <ludo@gnu.org> wrote:

> Someone™ has to invest time in studying this specific case, look at what
> others like Debian are doing, and seek consensus on a way forward.

Hum, I am probably not this Someone™ but here the result of my looks. :-)

First, please note that the Debian thread [1] is about,

    Concerns to software freedom when packaging deep-learning based appications

and not about specifically Leela Zero.  The thread asked the general
question keeping in mind the packaging of leela-zero [2] – now available
with Debian [3].  Back on 2019, patch#36071 [4] also introduced
leela-zero in Guix.  The issues about Machine Learning that this package
raises are not gone, although this specific package had been included,
because we are lacking some guideline. :-) Quoting the Debian package
description [3], it reads,

        Recomputing the AlphaGo Zero weights will take about 1700 years
        on commodity hardware. Upstream is running a public, distributed
        effort to repeat this work. Working together, and especially
        when starting on a smaller scale, it will take less than 1700
        years to get a good network (which you can feed into this
        program, suddenly making it strong). To help with this effort,
        run the leelaz-autogtp binary provided in this package. The
        best-known network weights file is at
        http://zero.sjeng.org/best-network

For instance, this message [5] from patch#36071,

        We need to ensure that the software necessary to train the
        networks is included.  Is this the case?

        Back to this patch: I think it’s fine to accept it as long as the
        software necessary for training is included.

would suggest that the training software should be part of the package
for inclusion although it would not be affordable to recompute this
training.  Well, applying this “criteria”, then GNU Backgamon (package
gnubg included since a while) should be rejected since there is no
training software for the neural network weights.  For the inclusion of
leela-zero in Guix, the argument is from [6] quoting [7]:

            So this is basically a distributed computing client as well as a
            Go engine that runs with the results of that distributed
            computing effort.

        If that's true, there is no separate ‘training software’ to
        worry about.

which draws the line: free client vs the “database”; as pointed in [8].

Somehow, we have to distinguish cases depending on the weights.

    If the weights are clearly separated, as with leela-zero, then the
    code (neural network) itself can be included.  Else if the weights
    are tangled with the code, then we distribute them only if their
    license is compliant with FSDG as any other data, as with GNU
    Backgamon, IIUC.

Well, I do not see any difference between pre-trained weights and icons
or sound or good fitted-parameters (e.g., the package
python-scikit-learn has a lot ;-)).  As I said elsewhere, I do not see
the difference between pre-trained neural network weights and genomic
references (e.g., the package r-bsgenome-hsapiens-1000genomes-hs37d5).

The only question for inclusion or not is about the license, IMHO.  For
sure, it is far better if we are able to recompute the weights.
However, similarly as debootstrapping is a strong recommendation, the
ability to recompute the pre-trained neural network weights must just be
a recommendation.

Please note this message [11] from Nicolas about VOSK models and
patch#62443 [12] already merged; the weights are separated as with the
package leela-zero.

All that said.  Second, please note Debian thread dates from 2018, as
well as the LWN article [13]; and I am not aware of something new since
then.  Third, I have never read something on this topic produced by GNU
or related; and the fact that GNU Backgamon [14] distributes the weights
without the way to reproduce them draws one line.  Fourth, we – at least
I – are still waiting an answer from licensing@fsf.org; on FSF side, I
am only aware about this [15] and also these white papers [16] about the
very specific case of Copilot.

On Debian side, I am only aware of [16,17]:

    Unofficial Policy for Debian & Machine Learning

1: https://lists.debian.org/debian-devel/2018/07/msg00153.html
2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903634
3: https://packages.debian.org/trixie/leela-zero
4: https://issues.guix.gnu.org/36071
5: https://issues.guix.gnu.org/36071#5
6: https://issues.guix.gnu.org/36071#6
7: https://lwn.net/Articles/760483/
8: https://issues.guix.gnu.org/36071#4-lineno34
9: https://yhetil.org/guix/87v8gtzvu3.fsf@gmail.com

11: https://yhetil.org/guix/87jzyshpyr.fsf@ngraves.fr
12: https://issues.guix.gnu.org/62443

10: https://lwn.net/Articles/760142/
11: https://www.gnu.org/software/gnubg/
12: https://www.fsf.org/bulletin/2022/spring/unjust-algorithms/
13: https://www.fsf.org/news/publication-of-the-fsf-funded-white-papers-on-questions-around-copilot

16: https://salsa.debian.org/deeplearning-team/ml-policy
17: https://people.debian.org/~lumin/debian-dl.html

Cheers,
simon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-07-03  9:39             ` Simon Tournier
@ 2023-07-04 13:05               ` zamfofex
  2023-07-04 20:03                 ` Vagrant Cascadian
  0 siblings, 1 reply; 21+ messages in thread
From: zamfofex @ 2023-07-04 13:05 UTC (permalink / raw)
  To: Simon Tournier, Ludovic Courtès
  Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel

> On 07/03/2023 6:39 AM -03 Simon Tournier <zimon.toutoune@gmail.com> wrote:
> 
> Well, I do not see any difference between pre-trained weights and icons
> or sound or good fitted-parameters (e.g., the package
> python-scikit-learn has a lot ;-)).  As I said elsewhere, I do not see
> the difference between pre-trained neural network weights and genomic
> references (e.g., the package r-bsgenome-hsapiens-1000genomes-hs37d5).

I feel like, although this might (arguably) not be the case for leela-zero nor Lc0 specifically, for certain machine learning projects, a pretrained network can affect the program’s behavior so deeply that it might be considered a program itself! Such networks usually approximate an arbitrary function. The more complex the model is, the more complex the behavior of this function can be, and thus the closer to being an arbitrary program it is.

But this “program” has no source code, it is effectively created in this binary form that is difficult to analyse.

In any case, I feel like the issue Ludovic was talking about “user autonomy” is fairly relevant (as I understand it). For icons, images, and other similar kinds of assets, it is easy enough for the user to replace them, or create their own if they want. But for pretrained networks, even if they are under a free license, the user might not be able to easily create their own network that suits their purposes.

For example, for an image recognition software, there might be data provided by the maintainers of the program that is able to recognise a specific set of objects in input images, but the user might want to use it to recognise a different kind of object. If it is too costly for the user to train a new network for their purposes (in terms of hardware and time required), the user is effectively entirely bound by the decisions of the maintainers of the software, and they can’t change it to suit their purposes.

In that sense, there *might* be room for the maintainers to intentionally and maliciously bind the user to the kinds of data they want to provide. And perhaps even more likely (and even more dangerously), when the data is opaque enough, there is room for the maintainers to bias the networks in obscure ways without telling the user. You can imagine this being used in the context of, say, text generation or translation, for the developers to embed a certain opinion they have into the network in order to bias people towards it.

But even when not done maliciously, this can still be limiting to the user if they are unable to easily train their own networks as a replacement.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
  2023-07-04 13:05               ` zamfofex
@ 2023-07-04 20:03                 ` Vagrant Cascadian
  0 siblings, 0 replies; 21+ messages in thread
From: Vagrant Cascadian @ 2023-07-04 20:03 UTC (permalink / raw)
  To: zamfofex, Simon Tournier, Ludovic Courtès
  Cc: 宋文武, Ryan Prior, Nicolas Graves, guix-devel

[-- Attachment #1: Type: text/plain, Size: 2784 bytes --]

On 2023-07-04, zamfofex wrote:
>> On 07/03/2023 6:39 AM -03 Simon Tournier <zimon.toutoune@gmail.com> wrote:
>> 
>> Well, I do not see any difference between pre-trained weights and icons
>> or sound or good fitted-parameters (e.g., the package
>> python-scikit-learn has a lot ;-)).  As I said elsewhere, I do not see
>> the difference between pre-trained neural network weights and genomic
>> references (e.g., the package r-bsgenome-hsapiens-1000genomes-hs37d5).
>
> I feel like, although this might (arguably) not be the case for
> leela-zero nor Lc0 specifically, for certain machine learning
> projects, a pretrained network can affect the program’s behavior so
> deeply that it might be considered a program itself! Such networks
> usually approximate an arbitrary function. The more complex the model
> is, the more complex the behavior of this function can be, and thus
> the closer to being an arbitrary program it is.
>
> But this “program” has no source code, it is effectively created in
> this binary form that is difficult to analyse.
>
> In any case, I feel like the issue Ludovic was talking about “user
> autonomy” is fairly relevant (as I understand it). For icons, images,
> and other similar kinds of assets, it is easy enough for the user to
> replace them, or create their own if they want. But for pretrained
> networks, even if they are under a free license, the user might not be
> able to easily create their own network that suits their purposes.
>
> For example, for an image recognition software, there might be data
> provided by the maintainers of the program that is able to recognise a
> specific set of objects in input images, but the user might want to
> use it to recognise a different kind of object. If it is too costly
> for the user to train a new network for their purposes (in terms of
> hardware and time required), the user is effectively entirely bound by
> the decisions of the maintainers of the software, and they can’t
> change it to suit their purposes.

For a more concrete example, with facial reconition in particular, many
models are quite good at recognition of faces of people of predominantly
white european descent, and not very good with people of other
backgrounds, in particular with darker skin. The models frequently
reflect the blatant and subtle biases of the society in which they are
created, and the creators who develop the models. This can have
disasterous consequences when using these models without that
understanding... (or even if you do understand the general biases!)

This seems like a significant issue for user freedom; with source code,
you can at least in theory examine the biases of the software you are
using.


live well,
  vagrant

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-07-04 20:04 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-03 18:07 Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?) Ryan Prior
2023-04-03 20:48 ` Nicolas Graves via Development of GNU Guix and the GNU System distribution.
2023-04-03 21:18   ` Jack Hill
2023-04-06  8:42 ` Simon Tournier
2023-04-06 13:41   ` Kyle
2023-04-06 14:53     ` Simon Tournier
2023-05-13  4:13   ` 宋文武
2023-05-15 11:18     ` Simon Tournier
2023-05-26 15:37       ` Ludovic Courtès
2023-05-29  3:57         ` zamfofex
2023-05-30 13:15         ` Simon Tournier
2023-07-02 19:51           ` Ludovic Courtès
2023-07-03  9:39             ` Simon Tournier
2023-07-04 13:05               ` zamfofex
2023-07-04 20:03                 ` Vagrant Cascadian
  -- strict thread matches above, loose matches on Subject: below --
2023-04-07  5:50 Nathan Dehnel
2023-04-07  9:42 ` Simon Tournier
2023-04-08 10:21   ` Nathan Dehnel
2023-04-11  8:37     ` Simon Tournier
2023-04-11 12:41       ` Nathan Dehnel
2023-04-12  9:32         ` Csepp

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.