all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Next Steps For the Software Heritage Problem
@ 2024-06-18  8:37 MSavoritias
  2024-06-18 14:19 ` Ian Eure
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: MSavoritias @ 2024-06-18  8:37 UTC (permalink / raw)
  To: guix-devel

Hello,

Context:

As you may already know there have discussions around Software Heritage
and the LLM model they are collaborating with for a bit now. The model
itself was announced at
https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/

As I have started writing some packages I became interested in how I
might actually stop my code from ever reaching Software Heritage or at
the very least said LLM model. Every single package in guix is added
there automatically.

I sent an email on Friday and I got an answer back that such consent
mechanism hasn't been implemented and I was shown the legal terms.
instead what I am supposed to do is:

After guix has my code, my code will be automatically in Software
Heritage and the LLM model. So I am supposed to opt out seperately with
both of them to ensure that my code wont be used for future versions.
This of course means that my code will stay forever in Software
Heritage and the LLM model (or some version of it at least).

The reasoning that was given was that code harvesting happens anyway
and we give an opt-out. I am guessing its opt-out and not opt-in
because they would have less code but this is speculation of course :)

This is against our desire to make it a welcoming space and also
against the spirit of our CoC. Specifically because authors do not know
this happens when they submit packages to Guix. So it is all done
without consent.

Next Steps:

So what can we do as a Guix community from here?
Communication/Writing wise:

1. Add a clear disclaimer/requirment that any new package that is added
in Guix, the person has to give consent or get consent from the person
that the package is written in. This needs to be added in the docs and
in the email procedures.
2. Make a blog post of our stance towards Software Heritage and the
code harvesting they are doing. This post will write in environmental
and ethical grounds why Guix is against this and mention specifically
Software Heritage. This is done to separate and mention that we do not
like what is happening in case anyone comes asking, and hopefully give
public pressure to Software Heritage.
3. Exclude all Software Heritage merch, stands, talks, people in
official capacity, logos, or anything else that participates in social
events of guix and write it in some rules we have. also write in
channel rules that Software Heritage is offtopic same way Non-Free
Software is offtopic.
4. There doesn't seem to be any movement on the side of Guix towards:
- Accountability in an official capacity of SH for the terrible
  handling of the trans name incident and a plan to make it easier in
  the future.
- The LLM problem that was mentioned in this email.
So with that said I urge anybody who has been in contact with them in
an official Guix capacity to come forward, otherwise I can volunteer to
be that. Idk if we have a community outreach thing I need to be in also
for that. (we should if not)

The above make two assumptions:
1. That the Guix community is against LLM/"AI". Which for environmental
and ethical grounds we should be.
2. That we are a consent culture.

Coding Wise this has been talked about before some potential options
are:
- Communicate with Software Heritage to be able to give a "sign" that
the code that is sent should go or not in the code harvesting project.
- Remove all Software Heritage integration since its too hard to be
  ethical about it and built a better solution.

Conclusion:

To summarize from the steps I wrote above, it seems Software Heritage
makes it harder and harder for us to actually be an inclusive,
welcoming space we want to be. Idk what that leaves us, as I said I am
not part of any "insider" discussions. But it seems to not move that
much and its time to start doing actionable things in another direction.

MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18  8:37 MSavoritias
@ 2024-06-18 14:19 ` Ian Eure
  2024-06-19  8:36   ` Dale Mellor
  2024-06-18 16:21 ` Greg Hogan
  2024-06-19 10:10 ` Efraim Flashner
  2 siblings, 1 reply; 42+ messages in thread
From: Ian Eure @ 2024-06-18 14:19 UTC (permalink / raw)
  To: guix-devel

Hi MSavoritias,

Thank you for the email.

I’m going to lay out this situation as clearly as I can, in the 
hope that others will better understand, and hopefully treat it 
with the seriousness it deserves.

1. Guix requests SWH to archive some source code.  This is fine.

2. SWH archives the code.  This is also fine.

3. SWH gives all their source to an AI company, HuggingFace.  This 
is questionable.  While fine in theory, the company they gave it 
to, HuggingFace, violates both the licenses of the code they’re 
given, and SWH’s own policy on LLMs.  Instead of terminating the 
partnership, SWH has continued to tout it as "responsible AI" in 
the face of these violations[1].  This makes me doubt whether 
they’re acting in good faith.

4. HuggingFace trains a LLM out of all the code they’re given and 
redistributes it.  This is *not* fine.  The LLM is a derivative 
work of the source code it’s trained on, which violates the 
licenses of many projects in its training set -- it’s akin to 
compiling a gigantic .so file built from the SWH dataset.

5. HuggingFace uses its StarCoder2 LLM to generate source code. 
This is *also* not fine.  This output is also a derivative work of 
the inputs, and it’s redistributed with no license or attribution 
whatsoever.  HuggingFace purports to include attribution in their 
model, however, their own tools make no use of it and emit code 
with no attribution.  You can observe this behavior yourself: 
https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground

I understand Guix’s participation is several degrees removed from 
where the core of the problem lies.  However, the partnership with 
SWH is indirectly enabling massive violations of the licenses of 
the software it packages.  Guix should stop doing that.

Thanks,

  — Ian

[1]: 
https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/

MSavoritias <email@msavoritias.me> writes:

> Hello,
>
> Context:
>
> As you may already know there have discussions around Software 
> Heritage
> and the LLM model they are collaborating with for a bit now. The 
> model
> itself was announced at
> https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/
>
> As I have started writing some packages I became interested in 
> how I
> might actually stop my code from ever reaching Software Heritage 
> or at
> the very least said LLM model. Every single package in guix is 
> added
> there automatically.
>
> I sent an email on Friday and I got an answer back that such 
> consent
> mechanism hasn't been implemented and I was shown the legal 
> terms.
> instead what I am supposed to do is:
>
> After guix has my code, my code will be automatically in 
> Software
> Heritage and the LLM model. So I am supposed to opt out 
> seperately with
> both of them to ensure that my code wont be used for future 
> versions.
> This of course means that my code will stay forever in Software
> Heritage and the LLM model (or some version of it at least).
>
> The reasoning that was given was that code harvesting happens 
> anyway
> and we give an opt-out. I am guessing its opt-out and not opt-in
> because they would have less code but this is speculation of 
> course :)
>
> This is against our desire to make it a welcoming space and also
> against the spirit of our CoC. Specifically because authors do 
> not know
> this happens when they submit packages to Guix. So it is all 
> done
> without consent.
>
> Next Steps:
>
> So what can we do as a Guix community from here?
> Communication/Writing wise:
>
> 1. Add a clear disclaimer/requirment that any new package that 
> is added
> in Guix, the person has to give consent or get consent from the 
> person
> that the package is written in. This needs to be added in the 
> docs and
> in the email procedures.
> 2. Make a blog post of our stance towards Software Heritage and 
> the
> code harvesting they are doing. This post will write in 
> environmental
> and ethical grounds why Guix is against this and mention 
> specifically
> Software Heritage. This is done to separate and mention that we 
> do not
> like what is happening in case anyone comes asking, and 
> hopefully give
> public pressure to Software Heritage.
> 3. Exclude all Software Heritage merch, stands, talks, people in
> official capacity, logos, or anything else that participates in 
> social
> events of guix and write it in some rules we have. also write in
> channel rules that Software Heritage is offtopic same way 
> Non-Free
> Software is offtopic.
> 4. There doesn't seem to be any movement on the side of Guix 
> towards:
> - Accountability in an official capacity of SH for the terrible
>   handling of the trans name incident and a plan to make it 
>   easier in
>   the future.
> - The LLM problem that was mentioned in this email.
> So with that said I urge anybody who has been in contact with 
> them in
> an official Guix capacity to come forward, otherwise I can 
> volunteer to
> be that. Idk if we have a community outreach thing I need to be 
> in also
> for that. (we should if not)
>
> The above make two assumptions:
> 1. That the Guix community is against LLM/"AI". Which for 
> environmental
> and ethical grounds we should be.
> 2. That we are a consent culture.
>
> Coding Wise this has been talked about before some potential 
> options
> are:
> - Communicate with Software Heritage to be able to give a "sign" 
> that
> the code that is sent should go or not in the code harvesting 
> project.
> - Remove all Software Heritage integration since its too hard to 
> be
>   ethical about it and built a better solution.
>
> Conclusion:
>
> To summarize from the steps I wrote above, it seems Software 
> Heritage
> makes it harder and harder for us to actually be an inclusive,
> welcoming space we want to be. Idk what that leaves us, as I 
> said I am
> not part of any "insider" discussions. But it seems to not move 
> that
> much and its time to start doing actionable things in another 
> direction.
>
> MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18  8:37 MSavoritias
  2024-06-18 14:19 ` Ian Eure
@ 2024-06-18 16:21 ` Greg Hogan
  2024-06-18 16:33   ` MSavoritias
  2024-06-19 10:10 ` Efraim Flashner
  2 siblings, 1 reply; 42+ messages in thread
From: Greg Hogan @ 2024-06-18 16:21 UTC (permalink / raw)
  To: MSavoritias; +Cc: guix-devel

On Tue, Jun 18, 2024 at 4:37 AM MSavoritias <email@msavoritias.me> wrote:
>
> 1. Add a clear disclaimer/requirment that any new package that is added
> in Guix, the person has to give consent or get consent from the person
> that the package is written in. This needs to be added in the docs and
> in the email procedures.

You will be happy to know that Guix has always had this requirement
[1] by only packaging software licensed with the four essential
freedoms [2]. It's the first item on the Guix homepage.

[1] https://guix.gnu.org/manual/en/html_node/Software-Freedom.html
[2] https://www.gnu.org/philosophy/free-sw.en.html


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 16:21 ` Greg Hogan
@ 2024-06-18 16:33   ` MSavoritias
  2024-06-18 17:31     ` Greg Hogan
  0 siblings, 1 reply; 42+ messages in thread
From: MSavoritias @ 2024-06-18 16:33 UTC (permalink / raw)
  To: Greg Hogan; +Cc: MSavoritias, guix-devel

On Tue, 18 Jun 2024 12:21:33 -0400
Greg Hogan <code@greghogan.com> wrote:

> On Tue, Jun 18, 2024 at 4:37 AM MSavoritias <email@msavoritias.me>
> wrote:
> >
> > 1. Add a clear disclaimer/requirment that any new package that is
> > added in Guix, the person has to give consent or get consent from
> > the person that the package is written in. This needs to be added
> > in the docs and in the email procedures.  
> 
> You will be happy to know that Guix has always had this requirement
> [1] by only packaging software licensed with the four essential
> freedoms [2]. It's the first item on the Guix homepage.
> 
> [1] https://guix.gnu.org/manual/en/html_node/Software-Freedom.html
> [2] https://www.gnu.org/philosophy/free-sw.en.html

Ah it seems I wasn't clear enough.
I meant write something like:

By packaging a software project for Guix you are exposing said software
to a code harvesting project (also known as LLMs or "AI") run by
Software Heritage and/or their partners. Make sure you have gotten
fully informed consent and that the author of this package fully
understands what the implications are.

Something like that. To make it clear that the package that is about to
be added to Guix is going to be harvested for the LLM models Software
Heritage decided to share the code with.

Hope this is more clear.

MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
@ 2024-06-18 17:12 Andy Tai
  2024-06-18 18:08 ` Ian Eure
  0 siblings, 1 reply; 42+ messages in thread
From: Andy Tai @ 2024-06-18 17:12 UTC (permalink / raw)
  To: guix-devel

What is the role of GNU Guix in this? If Guix is mainly a referral
mechanism like web page links to the actual contents, the real problem
is not Guix but the use of free software which can be obtained  via
other mechanisms directly anyway to train LLMs if Guix is not in the
loop?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 16:33   ` MSavoritias
@ 2024-06-18 17:31     ` Greg Hogan
  2024-06-18 17:57       ` Ian Eure
  2024-06-19  7:01       ` MSavoritias
  0 siblings, 2 replies; 42+ messages in thread
From: Greg Hogan @ 2024-06-18 17:31 UTC (permalink / raw)
  To: MSavoritias; +Cc: guix-devel

On Tue, Jun 18, 2024 at 12:33 PM MSavoritias <email@msavoritias.me> wrote:
>
> Ah it seems I wasn't clear enough.
> I meant write something like:
>
> By packaging a software project for Guix you are exposing said software
> to a code harvesting project (also known as LLMs or "AI") run by
> Software Heritage and/or their partners. Make sure you have gotten
> fully informed consent and that the author of this package fully
> understands what the implications are.
>
> Something like that. To make it clear that the package that is about to
> be added to Guix is going to be harvested for the LLM models Software
> Heritage decided to share the code with.
>
> Hope this is more clear.

Free software licenses do not require bespoke consent to "to run the
program, to study and change the program in source code form, to
redistribute exact copies, and to distribute modified versions" (and
"Being free to do these things means (among other things) that you do
not have to ask or pay for permission to do so.").

Your fear mongering against free software runs afoul of Guix project
guidelines ("In addition, the GNU distribution follow [sic] the free
software distribution guidelines. Among other things, these guidelines
reject non-free firmware, recommendations of non-free software, and
discuss ways to deal with trademarks and patents.").

If you feel that LLMs/AI are violating the terms of a license, then
feel free to pursue that through the legal system (potentially very
profitable given the monetary penalties for violations of copyright).
Otherwise, we should be celebrating the users and use of free
software. I'm old enough to remember "Only wimps use tape backup:
_real_ men just upload their important stuff on ftp, and let the rest
of the world mirror it ;)"
[https://lkml.iu.edu/hypermail/linux/kernel/9607.2/0292.html].


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 17:31     ` Greg Hogan
@ 2024-06-18 17:57       ` Ian Eure
  2024-06-19  7:01       ` MSavoritias
  1 sibling, 0 replies; 42+ messages in thread
From: Ian Eure @ 2024-06-18 17:57 UTC (permalink / raw)
  To: guix-devel

Hi Greg,

Please read my earlier reply in this thread[1].

HuggingFace is demonstrably violating the licenses of the Free 
Software used to train its StarCoder2 LLM.

Software Heritage is continuing to partner with HuggingFace in 
spite of these violations.

Guix is continuing to partner with SWH in spite of their continued 
support of these violations.

Guix is indirectly enabling the violation of the license for the 
Free Software it packages.  Guix has the power to stop doing that. 
What is your specific rationale for continuing to enable these 
clear license violations?

Thanks,

  — Ian

[1]: 
https://lists.gnu.org/archive/html/guix-devel/2024-06/msg00195.html

Greg Hogan <code@greghogan.com> writes:

> On Tue, Jun 18, 2024 at 12:33 PM MSavoritias 
> <email@msavoritias.me> wrote:
>>
>> Ah it seems I wasn't clear enough.
>> I meant write something like:
>>
>> By packaging a software project for Guix you are exposing said 
>> software
>> to a code harvesting project (also known as LLMs or "AI") run 
>> by
>> Software Heritage and/or their partners. Make sure you have 
>> gotten
>> fully informed consent and that the author of this package 
>> fully
>> understands what the implications are.
>>
>> Something like that. To make it clear that the package that is 
>> about to
>> be added to Guix is going to be harvested for the LLM models 
>> Software
>> Heritage decided to share the code with.
>>
>> Hope this is more clear.
>
> Free software licenses do not require bespoke consent to "to run 
> the
> program, to study and change the program in source code form, to
> redistribute exact copies, and to distribute modified versions" 
> (and
> "Being free to do these things means (among other things) that 
> you do
> not have to ask or pay for permission to do so.").
>
> Your fear mongering against free software runs afoul of Guix 
> project
> guidelines ("In addition, the GNU distribution follow [sic] the 
> free
> software distribution guidelines. Among other things, these 
> guidelines
> reject non-free firmware, recommendations of non-free software, 
> and
> discuss ways to deal with trademarks and patents.").
>
> If you feel that LLMs/AI are violating the terms of a license, 
> then
> feel free to pursue that through the legal system (potentially 
> very
> profitable given the monetary penalties for violations of 
> copyright).
> Otherwise, we should be celebrating the users and use of free
> software. I'm old enough to remember "Only wimps use tape 
> backup:
> _real_ men just upload their important stuff on ftp, and let the 
> rest
> of the world mirror it ;)"
> [https://lkml.iu.edu/hypermail/linux/kernel/9607.2/0292.html].



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 17:12 Andy Tai
@ 2024-06-18 18:08 ` Ian Eure
  2024-06-19 10:31   ` raingloom
  2024-06-27 12:27   ` Ludovic Courtès
  0 siblings, 2 replies; 42+ messages in thread
From: Ian Eure @ 2024-06-18 18:08 UTC (permalink / raw)
  To: guix-devel

Guix sends archive requests to SWH.  SWH gives that source code to 
HuggingFace.  HuggingFace demonstrably violates the licenses.

Guix could stop sending archive requests to SWH.  This wouldn’t 
*stop* the bad things from happening, but it would *stop 
condoning* them.  The same as how Guix not allowing non-free 
software doesn’t stop people from running it, but doesn’t condone 
it.

Please read my replies in this thread, and the earlier 
"Concerns/questions around Software Heritage Archive" one.  I have 
outlined the situation, repeatedly, with references.

Thanks,

  — Ian

Andy Tai <atai@atai.org> writes:

> What is the role of GNU Guix in this? If Guix is mainly a 
> referral
> mechanism like web page links to the actual contents, the real 
> problem
> is not Guix but the use of free software which can be obtained 
> via
> other mechanisms directly anyway to train LLMs if Guix is not in 
> the
> loop?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 17:31     ` Greg Hogan
  2024-06-18 17:57       ` Ian Eure
@ 2024-06-19  7:01       ` MSavoritias
  2024-06-19  9:57         ` Efraim Flashner
  2024-06-20  2:56         ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  1 sibling, 2 replies; 42+ messages in thread
From: MSavoritias @ 2024-06-19  7:01 UTC (permalink / raw)
  To: Greg Hogan; +Cc: guix-devel

On Tue, 18 Jun 2024 13:31:02 -0400
Greg Hogan <code@greghogan.com> wrote:

> On Tue, Jun 18, 2024 at 12:33 PM MSavoritias <email@msavoritias.me>
> wrote:
> >
> > Ah it seems I wasn't clear enough.
> > I meant write something like:
> >
> > By packaging a software project for Guix you are exposing said
> > software to a code harvesting project (also known as LLMs or "AI")
> > run by Software Heritage and/or their partners. Make sure you have
> > gotten fully informed consent and that the author of this package
> > fully understands what the implications are.
> >
> > Something like that. To make it clear that the package that is
> > about to be added to Guix is going to be harvested for the LLM
> > models Software Heritage decided to share the code with.
> >
> > Hope this is more clear.  
> 
> Free software licenses do not require bespoke consent to "to run the
> program, to study and change the program in source code form, to
> redistribute exact copies, and to distribute modified versions" (and
> "Being free to do these things means (among other things) that you do
> not have to ask or pay for permission to do so.").
> 
> Your fear mongering against free software runs afoul of Guix project
> guidelines ("In addition, the GNU distribution follow [sic] the free
> software distribution guidelines. Among other things, these guidelines
> reject non-free firmware, recommendations of non-free software, and
> discuss ways to deal with trademarks and patents.").
> 
> If you feel that LLMs/AI are violating the terms of a license, then
> feel free to pursue that through the legal system (potentially very
> profitable given the monetary penalties for violations of copyright).
> Otherwise, we should be celebrating the users and use of free
> software. I'm old enough to remember "Only wimps use tape backup:
> _real_ men just upload their important stuff on ftp, and let the rest
> of the world mirror it ;)"
> [https://lkml.iu.edu/hypermail/linux/kernel/9607.2/0292.html].

Hey Greg,

You seem to be arguing on a different thread or a point I never made. I
didn't talk about licenses or legal/state rules before you mentioned
them. What I have mentioned is that SH breaks our social rules and
expectations by feeding all code into an algorithm that will endlessly
output the same as original.

I am not interested what the states or licenses/copyrights allow or
don't allow in this case. What I care about is what we expect as a
community when we submit a package/code to guix and if that violates
our social rules and expectations. And from what I have seen and talked
with people it does indeed.

PS. I am also not a man :P

Regards,
MSavoritias



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
@ 2024-06-19  7:52 Simon Tournier
  2024-06-19  9:13 ` MSavoritias
  0 siblings, 1 reply; 42+ messages in thread
From: Simon Tournier @ 2024-06-19  7:52 UTC (permalink / raw)
  To: Ian Eure, guix-devel

Hi Ian, all,

On Tue, 18 Jun 2024 at 10:57, Ian Eure <ian@retrospec.tv> wrote:

> Guix is continuing to partner with SWH in spite of their continued 
> support of these violations.

Quickly because I am in the middle of a busy day. :-)

I think that LLM asks ethical and legal question that even FSF or EFF or
SFC does not provide clear answers.  (And that probably the level where
the discussion should happen.)  That’s not a light topic and we should
not rush in one definitive conclusion.

Thank you for the rise of the concern some weeks ago.  It appears to me
good that people had expressed their concerns.  And still does.
Although I am reading there or overthere an aggressive tone; useless.

Again, people behind SWH are long-term free software activists and be
sure that they do not take this concern lightly.  FYI, people of SWH are
in touch with some people from Guix to speak about all that.


1. Legal.

These license violations are your interpretation of the law and to my
knowledge nothing have been in Court, yet.

Today, it does not really matter if we (or I) share this opinion.
Because for now, it’s just an opinion.

However, no one is a lawyer here and drawing a clear line is not simple.

Thus, FWIW, I would not jump in hard conclusions based on my own opinion
because today I am not confidant enough to emit a definitive legal
position.


2. Ethical.

If we speak about ethical concerns, we need to be very cautious.  We all
share the same core of values about free software.  Then we all do not
bound these values to the same point.  Some of us extend them to some
topics, other restrict a bit.

Here the issue is that other values than the ones about free software
are dragged in the picture to emit a position.  That’s where we need to
be cautious because we need to embrace the diversity and do not morally
judge what is outside our free software project.

About SWH, FWIW, here is my moral reasoning; as you see, it is far to be
definitive.

I think that LLM/IA is morally bad in climate change context; a moral
value outside free software, BTW.  By extension, HuggingFace appears to
me morally bad.

Then, is SWH morally bad because they did a partnership with
HuggingFace?  Is it morally bad to help SWH in harvesting source code?
Well, the answers do not jump to my eyes.

An analogy could be: Am I morally bad when I use my Github account to
report bugs of free software there?  Or when I contribute to free
software hosted on Github?  Let do not drift; I am just trying to expose
that moral questions are often more complex that yes or no.

All is not 0 and 1.  There is tradeoff and balance.

Back to SWH.  I consider that free software source code is part of human
culture and it must be preserved. Preserving source code is morally
good.

Thus, I think the mission of SWH is morally good.  Because their
partnership with UNESCO in order to collect and preserve this human
culture is morally good.  Then, helping in that mission appear to me
morally good.

Moreover, being able to rescue is also morally good.  For example, in
scientific context where the trust in scientific knowledge depends on
software that vanish.  This trust appears to me vitally important.

Therefore, it appears to me very harsh to jump in definitive moral
conclusion about the SWH initiative.


All that said, back to my busy day. :-)

Cheers,
simon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 14:19 ` Ian Eure
@ 2024-06-19  8:36   ` Dale Mellor
  2024-06-20 17:00     ` Andreas Enge
  0 siblings, 1 reply; 42+ messages in thread
From: Dale Mellor @ 2024-06-19  8:36 UTC (permalink / raw)
  To: Ian Eure, guix-devel

On Tue, 2024-06-18 at 07:19 -0700, Ian Eure wrote:
> Hi MSavoritias,
> 
> Thank you for the email.
> 
> I’m going to lay out this situation as clearly as I can, in the 
> hope that others will better understand, and hopefully treat it 
> with the seriousness it deserves.
> 
> 1. Guix requests SWH to archive some source code.  This is fine.

  No, it's not.  I use Guix as a tool to develop my own projects, private and
personal for reasons I'm keeping to myself.  As part of that I write package
definitions for them, and use the Guix machinery to build and test.  I *cannot*
have Guix just giving my code away to anybody, that is just fundamentally wrong.

  We need to ask what is Guix?  A free operating system, a framework for
developing free operating systems, or a more generic tool for software
development and deployment?  If the latter it *cannot* do nefarious things
without explicit consent.

  I think at least there should be a /restricted/ license type available to
package definitions, and the system absolutely should not give source code away
from packages which use this (of course, they won't get into the official
distribution, but that's fine).

  More broadly, I think they should just stop inter-operating with SH.  Just
walk away.

Dale



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  7:52 Next Steps For the Software Heritage Problem Simon Tournier
@ 2024-06-19  9:13 ` MSavoritias
  2024-06-19  9:54   ` Efraim Flashner
  2024-06-19 14:41   ` Simon Tournier
  0 siblings, 2 replies; 42+ messages in thread
From: MSavoritias @ 2024-06-19  9:13 UTC (permalink / raw)
  To: Simon Tournier; +Cc: Ian Eure, guix-devel

On Wed, 19 Jun 2024 09:52:36 +0200
Simon Tournier <zimon.toutoune@gmail.com> wrote:

> Hi Ian, all,
> 
> On Tue, 18 Jun 2024 at 10:57, Ian Eure <ian@retrospec.tv> wrote:
> 
> > Guix is continuing to partner with SWH in spite of their continued 
> > support of these violations.  
> 
> Quickly because I am in the middle of a busy day. :-)

Hey Simon,

> 
> I think that LLM asks ethical and legal question that even FSF or EFF
> or SFC does not provide clear answers.  (And that probably the level
> where the discussion should happen.)  That’s not a light topic and we
> should not rush in one definitive conclusion.
> 
> Thank you for the rise of the concern some weeks ago.  It appears to
> me good that people had expressed their concerns.  And still does.
> Although I am reading there or overthere an aggressive tone; useless.
> 
> Again, people behind SWH are long-term free software activists and be
> sure that they do not take this concern lightly.  FYI, people of SWH
> are in touch with some people from Guix to speak about all that.

That is a very good point actually and it is one I also raised in the
email I sent. That we have been told there are some discussions but we
haven't seen any results for over 6 months now. Hence me asking for
anybody that has approached SH in an official Guix capacity to step
forward. Otherwise as I said I can approach SH :)

> 
> 1. Legal.
> 
> These license violations are your interpretation of the law and to my
> knowledge nothing have been in Court, yet.
> 
> Today, it does not really matter if we (or I) share this opinion.
> Because for now, it’s just an opinion.
> 
> However, no one is a lawyer here and drawing a clear line is not
> simple.
> 
> Thus, FWIW, I would not jump in hard conclusions based on my own
> opinion because today I am not confidant enough to emit a definitive
> legal position.
> 

That is fair, I agree that copyright wise and legal/state wise the
answer is not clear at all. And I don't think anybody in this mailing
list can decidely answer that as you said.

> 2. Ethical.
> 
> If we speak about ethical concerns, we need to be very cautious.  We
> all share the same core of values about free software.  Then we all
> do not bound these values to the same point.  Some of us extend them
> to some topics, other restrict a bit.
> 
> Here the issue is that other values than the ones about free software
> are dragged in the picture to emit a position.  That’s where we need
> to be cautious because we need to embrace the diversity and do not
> morally judge what is outside our free software project.
> 
> About SWH, FWIW, here is my moral reasoning; as you see, it is far to
> be definitive.

I agree that we probably won't find any definitive answer if LLMs are
bad or not. But that is also not the question posed here tho.

The question posed here was that *all* code that is sent from Guix to
SH is automatically transfered without consent to be used in an LLM
model. That is without said process being opt-in and without said
process being transparent.

The second one could be solved by adding the disclaimer and making the
changes to commit packages as a i said. It can also be done I was told
by just stopping guix from uploading any new code to SH from any
package. which I would also be in favor.
The first one can be done with social pressure which is what the
blogpost and the talking and potentially the not including SH into Guix
go towards.

Whether LLMs are ethical or not has nothing to do with the question
posted above. Although personally I would push for not including LLMs
unless under strict criteria of environmental and ethical sourcing. but
that can come at a later time.

I would also like SH to see why opt-in should be the default at the
very least, and the process should be transparent to everybody putting
code into SH. Archiving source code is a good cause. This is why
I said to approach them in official Guix capacity :)

MSavoritias



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  9:13 ` MSavoritias
@ 2024-06-19  9:54   ` Efraim Flashner
  2024-06-19 10:25     ` raingloom
  2024-06-19 10:34     ` MSavoritias
  2024-06-19 14:41   ` Simon Tournier
  1 sibling, 2 replies; 42+ messages in thread
From: Efraim Flashner @ 2024-06-19  9:54 UTC (permalink / raw)
  To: MSavoritias; +Cc: Simon Tournier, Ian Eure, guix-devel

[-- Attachment #1: Type: text/plain, Size: 5629 bytes --]

On Wed, Jun 19, 2024 at 12:13:38PM +0300, MSavoritias wrote:
> On Wed, 19 Jun 2024 09:52:36 +0200
> Simon Tournier <zimon.toutoune@gmail.com> wrote:
> 
> > Hi Ian, all,
> > 
> > On Tue, 18 Jun 2024 at 10:57, Ian Eure <ian@retrospec.tv> wrote:
> > 
> > > Guix is continuing to partner with SWH in spite of their continued 
> > > support of these violations.  
> > 
> > Quickly because I am in the middle of a busy day. :-)
> 
> Hey Simon,
> 
> > 
> > I think that LLM asks ethical and legal question that even FSF or EFF
> > or SFC does not provide clear answers.  (And that probably the level
> > where the discussion should happen.)  That’s not a light topic and we
> > should not rush in one definitive conclusion.
> > 
> > Thank you for the rise of the concern some weeks ago.  It appears to
> > me good that people had expressed their concerns.  And still does.
> > Although I am reading there or overthere an aggressive tone; useless.
> > 
> > Again, people behind SWH are long-term free software activists and be
> > sure that they do not take this concern lightly.  FYI, people of SWH
> > are in touch with some people from Guix to speak about all that.
> 
> That is a very good point actually and it is one I also raised in the
> email I sent. That we have been told there are some discussions but we
> haven't seen any results for over 6 months now. Hence me asking for
> anybody that has approached SH in an official Guix capacity to step
> forward. Otherwise as I said I can approach SH :)

The relationship between SWH and Hugging Face is (IMO) off-topic for the
Guix mailing lists.  I'm not surprised that the discussions are
happening elsewhere.

> > 
> > 1. Legal.
> > 
> > These license violations are your interpretation of the law and to my
> > knowledge nothing have been in Court, yet.
> > 
> > Today, it does not really matter if we (or I) share this opinion.
> > Because for now, it’s just an opinion.
> > 
> > However, no one is a lawyer here and drawing a clear line is not
> > simple.
> > 
> > Thus, FWIW, I would not jump in hard conclusions based on my own
> > opinion because today I am not confidant enough to emit a definitive
> > legal position.
> > 
> 
> That is fair, I agree that copyright wise and legal/state wise the
> answer is not clear at all. And I don't think anybody in this mailing
> list can decidely answer that as you said.
> 
> > 2. Ethical.
> > 
> > If we speak about ethical concerns, we need to be very cautious.  We
> > all share the same core of values about free software.  Then we all
> > do not bound these values to the same point.  Some of us extend them
> > to some topics, other restrict a bit.
> > 
> > Here the issue is that other values than the ones about free software
> > are dragged in the picture to emit a position.  That’s where we need
> > to be cautious because we need to embrace the diversity and do not
> > morally judge what is outside our free software project.
> > 
> > About SWH, FWIW, here is my moral reasoning; as you see, it is far to
> > be definitive.
> 
> I agree that we probably won't find any definitive answer if LLMs are
> bad or not. But that is also not the question posed here tho.
> 
> The question posed here was that *all* code that is sent from Guix to
> SH is automatically transfered without consent to be used in an LLM
> model. That is without said process being opt-in and without said
> process being transparent.

I am not a lawyer, nor do I play one on TV.

Transferring the code is (legally) fine, using the code is (legally)
fine, distributing the result is (I think) legally questionable.

If your concern is the code being transferred to the LLM owners, IMO
that's already covered by the license of the code itself. As for what
the LLM owners do with the code, (again I am not a lawyer) it should not
make a difference if SWH gives them the code, they download it from
Guix's infrastructure or get it straight from upstream. Redistributing
the source code is allowed.

> The second one could be solved by adding the disclaimer and making the
> changes to commit packages as a i said. It can also be done I was told
> by just stopping guix from uploading any new code to SH from any
> package. which I would also be in favor.
> The first one can be done with social pressure which is what the
> blogpost and the talking and potentially the not including SH into Guix
> go towards.
> 
> Whether LLMs are ethical or not has nothing to do with the question
> posted above. Although personally I would push for not including LLMs
> unless under strict criteria of environmental and ethical sourcing. but
> that can come at a later time.
> 
> I would also like SH to see why opt-in should be the default at the
> very least, and the process should be transparent to everybody putting
> code into SH. Archiving source code is a good cause. This is why
> I said to approach them in official Guix capacity :)

One of our packages, dbxfs, left Github a while ago and continued
development on a different forge. They adjusted their README to disallow
hosting of their code on Github. Based on this restriction we have
labeled later versions of the software as non-free and have not updated
the package. IMO saying that source code cannot be uploaded to SWH would
fall into the same category.

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  7:01       ` MSavoritias
@ 2024-06-19  9:57         ` Efraim Flashner
  2024-06-20  2:56         ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  1 sibling, 0 replies; 42+ messages in thread
From: Efraim Flashner @ 2024-06-19  9:57 UTC (permalink / raw)
  To: MSavoritias; +Cc: Greg Hogan, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]

On Wed, Jun 19, 2024 at 10:01:43AM +0300, MSavoritias wrote:
> On Tue, 18 Jun 2024 13:31:02 -0400
> Greg Hogan <code@greghogan.com> wrote:
> 
> > On Tue, Jun 18, 2024 at 12:33 PM MSavoritias <email@msavoritias.me>
> > wrote:
> > >
<snip>
> > 
> > If you feel that LLMs/AI are violating the terms of a license, then
> > feel free to pursue that through the legal system (potentially very
> > profitable given the monetary penalties for violations of copyright).
> > Otherwise, we should be celebrating the users and use of free
> > software. I'm old enough to remember "Only wimps use tape backup:
> > _real_ men just upload their important stuff on ftp, and let the rest
> > of the world mirror it ;)"
> > [https://lkml.iu.edu/hypermail/linux/kernel/9607.2/0292.html].
> 
> Hey Greg,
> 
<snip>
> 
> PS. I am also not a man :P

To head off any potential misunderstanding, I followed the link above
and the line "Only wimps ..." is an old quote from Linus Torvalds, not
Greg assuming your gender :).

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18  8:37 MSavoritias
  2024-06-18 14:19 ` Ian Eure
  2024-06-18 16:21 ` Greg Hogan
@ 2024-06-19 10:10 ` Efraim Flashner
  2 siblings, 0 replies; 42+ messages in thread
From: Efraim Flashner @ 2024-06-19 10:10 UTC (permalink / raw)
  To: MSavoritias; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 828 bytes --]

On Tue, Jun 18, 2024 at 11:37:17AM +0300, MSavoritias wrote:
> Hello,
<snip>
> So with that said I urge anybody who has been in contact with them in
> an official Guix capacity to come forward, otherwise I can volunteer to
> be that. Idk if we have a community outreach thing I need to be in also
> for that. (we should if not)
<snip>

Without addressing the rest of the email, I'd like to point out that if
the Guix project needs to interact with SWH (or Hugging Face) in an
official capacity then the maintainers will either do it or take care of
it. Thank you for your offer, we'll keep it in mind.

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  9:54   ` Efraim Flashner
@ 2024-06-19 10:25     ` raingloom
  2024-06-19 15:46       ` Ekaitz Zarraga
  2024-06-19 10:34     ` MSavoritias
  1 sibling, 1 reply; 42+ messages in thread
From: raingloom @ 2024-06-19 10:25 UTC (permalink / raw)
  To: MSavoritias, Simon Tournier, Ian Eure, guix-devel

On 2024-06-19 11:54, Efraim Flashner wrote:
> On Wed, Jun 19, 2024 at 12:13:38PM +0300, MSavoritias wrote:
> ...
> One of our packages, dbxfs, left Github a while ago and continued
> development on a different forge. They adjusted their README to disallow
> hosting of their code on Github. Based on this restriction we have
> labeled later versions of the software as non-free and have not updated
> the package. IMO saying that source code cannot be uploaded to SWH would
> fall into the same category.

No wonder more and more people are growing dissatisfied with the free
software movement.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 18:08 ` Ian Eure
@ 2024-06-19 10:31   ` raingloom
  2024-06-27 12:27   ` Ludovic Courtès
  1 sibling, 0 replies; 42+ messages in thread
From: raingloom @ 2024-06-19 10:31 UTC (permalink / raw)
  To: Ian Eure; +Cc: guix-devel

On 2024-06-18 20:08, Ian Eure wrote:
> Andy Tai <atai@atai.org> writes:
> 
>> What is the role of GNU Guix in this? If Guix is mainly a referral
>> mechanism like web page links to the actual contents, the real problem
>> is not Guix but the use of free software which can be obtained via
>> other mechanisms directly anyway to train LLMs if Guix is not in the
>> loop?

> Guix sends archive requests to SWH.  SWH gives that source code to HuggingFace.  HuggingFace demonstrably violates the licenses.
> 
> Guix could stop sending archive requests to SWH.  This wouldn’t *stop* the bad things from happening, but it would *stop condoning* them.  The same as how Guix not allowing non-free software doesn’t stop people from running it, but doesn’t condone it.
> ...

Guix doesn't just condone it in this case, it's actively helping SWH out
by submitting packages.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  9:54   ` Efraim Flashner
  2024-06-19 10:25     ` raingloom
@ 2024-06-19 10:34     ` MSavoritias
  1 sibling, 0 replies; 42+ messages in thread
From: MSavoritias @ 2024-06-19 10:34 UTC (permalink / raw)
  To: Efraim Flashner; +Cc: MSavoritias, Simon Tournier, Ian Eure, guix-devel

On Wed, 19 Jun 2024 12:54:30 +0300
Efraim Flashner <efraim@flashner.co.il> wrote:

> On Wed, Jun 19, 2024 at 12:13:38PM +0300, MSavoritias wrote:
> > On Wed, 19 Jun 2024 09:52:36 +0200
> > Simon Tournier <zimon.toutoune@gmail.com> wrote:
> >   
> > > Hi Ian, all,
> > > 
> > > On Tue, 18 Jun 2024 at 10:57, Ian Eure <ian@retrospec.tv> wrote:
>
> > > I think that LLM asks ethical and legal question that even FSF or
> > > EFF or SFC does not provide clear answers.  (And that probably
> > > the level where the discussion should happen.)  That’s not a
> > > light topic and we should not rush in one definitive conclusion.
> > > 
> > > Thank you for the rise of the concern some weeks ago.  It appears
> > > to me good that people had expressed their concerns.  And still
> > > does. Although I am reading there or overthere an aggressive
> > > tone; useless.
> > > 
> > > Again, people behind SWH are long-term free software activists
> > > and be sure that they do not take this concern lightly.  FYI,
> > > people of SWH are in touch with some people from Guix to speak
> > > about all that.  
> > 
> > That is a very good point actually and it is one I also raised in
> > the email I sent. That we have been told there are some discussions
> > but we haven't seen any results for over 6 months now. Hence me
> > asking for anybody that has approached SH in an official Guix
> > capacity to step forward. Otherwise as I said I can approach SH :)  
> 
> The relationship between SWH and Hugging Face is (IMO) off-topic for
> the Guix mailing lists.  I'm not surprised that the discussions are
> happening elsewhere.

Given that any code and package that is contributed to Guix goes to SWH
and Hugging Face I would disagree.

> > > 2. Ethical.
> > > 
> > > If we speak about ethical concerns, we need to be very cautious.
> > > We all share the same core of values about free software.  Then
> > > we all do not bound these values to the same point.  Some of us
> > > extend them to some topics, other restrict a bit.
> > > 
> > > Here the issue is that other values than the ones about free
> > > software are dragged in the picture to emit a position.  That’s
> > > where we need to be cautious because we need to embrace the
> > > diversity and do not morally judge what is outside our free
> > > software project.
> > > 
> > > About SWH, FWIW, here is my moral reasoning; as you see, it is
> > > far to be definitive.  
> > 
> > I agree that we probably won't find any definitive answer if LLMs
> > are bad or not. But that is also not the question posed here tho.
> > 
> > The question posed here was that *all* code that is sent from Guix
> > to SH is automatically transfered without consent to be used in an
> > LLM model. That is without said process being opt-in and without
> > said process being transparent.  
> 
> I am not a lawyer, nor do I play one on TV.
> 
> Transferring the code is (legally) fine, using the code is (legally)
> fine, distributing the result is (I think) legally questionable.
> 
> If your concern is the code being transferred to the LLM owners, IMO
> that's already covered by the license of the code itself. As for what
> the LLM owners do with the code, (again I am not a lawyer) it should
> not make a difference if SWH gives them the code, they download it
> from Guix's infrastructure or get it straight from upstream.
> Redistributing the source code is allowed.

Idk if you read the email that was sent to Greg in the other thread.
Given that you replied there too I assume you did.
So given this context I am repeating again that is not about legal and
let me copy-past my reply to the legal argument:

Quote:
You seem to be arguing on a different thread or a point I never made. I
didn't talk about licenses or legal/state rules before you mentioned
them. What I have mentioned is that SH breaks our social rules and
expectations by feeding all code into an algorithm that will endlessly
output the same as original.

I am not interested what the states or licenses/copyrights allow or
don't allow in this case. What I care about is what we expect as a
community when we submit a package/code to guix and if that violates
our social rules and expectations. And from what I have seen and talked
with people it does indeed.

> > The second one could be solved by adding the disclaimer and making
> > the changes to commit packages as a i said. It can also be done I
> > was told by just stopping guix from uploading any new code to SH
> > from any package. which I would also be in favor.
> > The first one can be done with social pressure which is what the
> > blogpost and the talking and potentially the not including SH into
> > Guix go towards.
> > 
> > Whether LLMs are ethical or not has nothing to do with the question
> > posted above. Although personally I would push for not including
> > LLMs unless under strict criteria of environmental and ethical
> > sourcing. but that can come at a later time.
> > 
> > I would also like SH to see why opt-in should be the default at the
> > very least, and the process should be transparent to everybody
> > putting code into SH. Archiving source code is a good cause. This
> > is why I said to approach them in official Guix capacity :)  
> 
> One of our packages, dbxfs, left Github a while ago and continued
> development on a different forge. They adjusted their README to
> disallow hosting of their code on Github. Based on this restriction
> we have labeled later versions of the software as non-free and have
> not updated the package. IMO saying that source code cannot be
> uploaded to SWH would fall into the same category.

Good thing that is not what i suggested then. :)

Regards,
MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  9:13 ` MSavoritias
  2024-06-19  9:54   ` Efraim Flashner
@ 2024-06-19 14:41   ` Simon Tournier
  2024-06-20  6:51     ` MSavoritias
  1 sibling, 1 reply; 42+ messages in thread
From: Simon Tournier @ 2024-06-19 14:41 UTC (permalink / raw)
  To: MSavoritias; +Cc: Ian Eure, guix-devel

Hi MSavoritias, all,

Let me provide more context.

The concern started couple of months ago, to my knowledge.  And
discussion is still on going.  So I think that’s incorrect to say “any
result for over 6 months”.

Moreover, I feel you have a misunderstanding about HuggingFace and SWH
partnership.  From the reading of public information, HuggingFace and
BigCode trains on a subset of SWH source code archive.  I mean, it is a
snapshot and to my knowledge, they provided the list of source code that
had been used for training.

Not to avoid the question but from a pragmatic point of view, one might
ask if the source code you write and do not want to be included in the
training dataset, if this source code is concretely part of that
training dataset.

HuggingFace is not training continuously with source code from SWH.

And technically, SWH is an archive i.e., the code is not stored hot.  I
do not know and I have not read all details by HuggingFace of their
method; i.e., which kind of data they process – independent unique
files, complete repository, etc.  What I know is that the piece when
fetching from SWH is named SWH Vault; it requires to “cook” and prepare
all the files that take times, from minutes to days.


All that to say two key points:

1. People behind SWH are well-aware about various sides of the concerns.
As said, they are long-time free software supporters.  Be sure they have
eared community concerns.  Some discussions are still pending because as
explained, all sides of ethical questions needs to be cautious.

Please do not think it is ignored.


2. FWIW, I am in touch with SWH people – among other members from Guix
community.  For instance, in order to feed the discussion, Roberto from
SWH pointed to me this blog point by Bruce Perens:

    https://perens.com/2019/10/12/invasion-of-the-ethical-licenses/

Well, I do not know if the outcome will be aligned with your current
opinion, but be sure that your concerns as the others raised by Guix
community members are taking into account.


Cheers,
simon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19 10:25     ` raingloom
@ 2024-06-19 15:46       ` Ekaitz Zarraga
  2024-06-20  6:36         ` MSavoritias
  0 siblings, 1 reply; 42+ messages in thread
From: Ekaitz Zarraga @ 2024-06-19 15:46 UTC (permalink / raw)
  To: raingloom, MSavoritias, Simon Tournier, Ian Eure, guix-devel

On 2024-06-19 12:25, raingloom@riseup.net wrote:
> On 2024-06-19 11:54, Efraim Flashner wrote:
>> On Wed, Jun 19, 2024 at 12:13:38PM +0300, MSavoritias wrote:
>> ...
>> One of our packages, dbxfs, left Github a while ago and continued
>> development on a different forge. They adjusted their README to disallow
>> hosting of their code on Github. Based on this restriction we have
>> labeled later versions of the software as non-free and have not updated
>> the package. IMO saying that source code cannot be uploaded to SWH would
>> fall into the same category.
> 
> No wonder more and more people are growing dissatisfied with the free
> software movement.
> 

There are many valid reasons why someone might criticize the Free 
Software movement and people behind it, but making free software only 
has 4 simple rules. If you don't comply with them you are not free 
software anymore. It's as simple as that, and that simple it should be.

Free Software gives me the FREEDOM to print the code, make a roll with 
it and shove it up my ass if I want to (and even distribute my modified 
copies for other people to do so). The same freedom I have to upload it 
to github. If you prevent me from doing one or the other you are 
restricting my freedom and that's defeating the purpose of free software 
and we cannot consider your code free software anymore. The line is 
clear, and trying to pretend to be free software while restricting 
people's freedoms (regardless of what they are) is absurd.

The Free Software movement can be labeled (and is often labeled) as a 
political movement but I'd say it's more of an ethical movement. It's a 
way to share *values* and the value we share here is freedom. We might 
or might not share other values, politics, religion or anything, but as 
long as we put the freedom in the first place we should agree that free 
software is better than any other software model we have.

There are bad actors in the world (say thieves, killers or... GitHub and 
AI), and we can discuss about how we should deal with them but I don't 
think the answer is putting our *values* aside but embrace them harder 
(one value, freedom, in our case).

If people is not happy with the Free Software movement because it puts 
the freedom first, I can only understand it as people being mad about 
Free Software because it's about software.

For other values, we can start other initiatives I may or may not agree 
more with, but if the value is freedom (in software), I don't think 
there's any better way to push for it. But trying to disguise other 
things inside of the Free Software is kind of dishonest.

I don't know, maybe I'm just a little bit tired.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  7:01       ` MSavoritias
  2024-06-19  9:57         ` Efraim Flashner
@ 2024-06-20  2:56         ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-06-20  5:18           ` MSavoritias
  1 sibling, 1 reply; 42+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2024-06-20  2:56 UTC (permalink / raw)
  To: MSavoritias, Greg Hogan; +Cc: guix-devel

Hi MSavoritias,

On Wed, Jun 19 2024, MSavoritias wrote:

> I am not interested what the states or licenses/copyrights allow or
> don't allow in this case. What I care about is what we expect as a
> community when we submit a package/code to guix and if that violates
> our social rules and expectations.

Just in case the sweeping mention of our social rules and expectations
includes me, please know that licensing and copyright are a big part of
why I am a part of this community.

Kind regards
Felix


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20  2:56         ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
@ 2024-06-20  5:18           ` MSavoritias
  0 siblings, 0 replies; 42+ messages in thread
From: MSavoritias @ 2024-06-20  5:18 UTC (permalink / raw)
  To: Felix Lechner; +Cc: Greg Hogan, guix-devel

On Wed, 19 Jun 2024 19:56:26 -0700
Felix Lechner <felix.lechner@lease-up.com> wrote:

> Hi MSavoritias,
> 
> On Wed, Jun 19 2024, MSavoritias wrote:
> 
> > I am not interested what the states or licenses/copyrights allow or
> > don't allow in this case. What I care about is what we expect as a
> > community when we submit a package/code to guix and if that violates
> > our social rules and expectations.  
> 
> Just in case the sweeping mention of our social rules and expectations
> includes me, please know that licensing and copyright are a big part
> of why I am a part of this community.
> 
> Kind regards
> Felix

Sure we all are.
But remember that we also have a CoC and social rules because building
a community can't be done on top of legal rules ie. copyright.
Just like social rules shouldn't be used for legal matters all the
time, same way with copyright for social rules. Which is what I am
saying here.

MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19 15:46       ` Ekaitz Zarraga
@ 2024-06-20  6:36         ` MSavoritias
  2024-06-20 14:35           ` Ekaitz Zarraga
  0 siblings, 1 reply; 42+ messages in thread
From: MSavoritias @ 2024-06-20  6:36 UTC (permalink / raw)
  To: Ekaitz Zarraga
  Cc: raingloom, MSavoritias, Simon Tournier, Ian Eure, guix-devel

On Wed, 19 Jun 2024 17:46:08 +0200
Ekaitz Zarraga <ekaitz@elenq.tech> wrote:

> On 2024-06-19 12:25, raingloom@riseup.net wrote:
> > On 2024-06-19 11:54, Efraim Flashner wrote:  
> >> On Wed, Jun 19, 2024 at 12:13:38PM +0300, MSavoritias wrote:
> >> ...
> >> One of our packages, dbxfs, left Github a while ago and continued
> >> development on a different forge. They adjusted their README to
> >> disallow hosting of their code on Github. Based on this
> >> restriction we have labeled later versions of the software as
> >> non-free and have not updated the package. IMO saying that source
> >> code cannot be uploaded to SWH would fall into the same category.  
> > 
> > No wonder more and more people are growing dissatisfied with the
> > free software movement.
> >   
> 
Hey Ekaitz,

Please remember two things in the context of all of this:
1. Guix is not a software entity but it is made of people that want a
safer, collaborative space to create things. These things may be code,
a blog post or anything else as part of guix. Even a social network
account. I am saying this because you only talked about Free Software
in your message and not about people or different contexts.
And we are talking about people here. Not code. Code is not alive.

2. You seem to imply that Free Software or code is apolitical. (in the
sense of social or state politics not) Which it is not. Nothing is.
For example Free Software is explicitly pro-capitalist and
pro-Google/big companies. I am not saying I disagree, but its good
to keep in mind that politics exist and do exist always. And in the case

> There are many valid reasons why someone might criticize the Free 
> Software movement and people behind it, but making free software only 
> has 4 simple rules. If you don't comply with them you are not free 
> software anymore. It's as simple as that, and that simple it should
> be.
> 
> Free Software gives me the FREEDOM to print the code, make a roll
> with it and shove it up my ass if I want to (and even distribute my
> modified copies for other people to do so). The same freedom I have
> to upload it to github. If you prevent me from doing one or the other
> you are restricting my freedom and that's defeating the purpose of
> free software and we cannot consider your code free software anymore.
> The line is clear, and trying to pretend to be free software while
> restricting people's freedoms (regardless of what they are) is absurd.

This is missing the context that GPL does indeed restrict people's
freedom to license code as the see fit. Because it was written to
further the political goals of FSF. It is on purpose. So we are already
restricting the freedom of people to do what they want on purpose.

And lets not forget 
"your freedom ends where the other persons freedom begins"
and consent of course in the issue at hand.

> 
> The Free Software movement can be labeled (and is often labeled) as a 
> political movement but I'd say it's more of an ethical movement. It's
> a way to share *values* and the value we share here is freedom. We
> might or might not share other values, politics, religion or
> anything, but as long as we put the freedom in the first place we
> should agree that free software is better than any other software
> model we have.
> 
> There are bad actors in the world (say thieves, killers or... GitHub
> and AI), and we can discuss about how we should deal with them but I
> don't think the answer is putting our *values* aside but embrace them
> harder (one value, freedom, in our case).

Definetily agree. The solution is not to embrace propietary software or
restrict software. Its to write down some common social rules that are
rooted in consent.

> If people is not happy with the Free Software movement because it
> puts the freedom first, I can only understand it as people being mad
> about Free Software because it's about software.
> 
> For other values, we can start other initiatives I may or may not
> agree more with, but if the value is freedom (in software), I don't
> think there's any better way to push for it. But trying to disguise
> other things inside of the Free Software is kind of dishonest.

Fair. I mean we already have CoC and channel descriptions. Idk if we
have event guidelines/CoC yet but we should.

> I don't know, maybe I'm just a little bit tired.

No worries. I think it was very well said.

MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19 14:41   ` Simon Tournier
@ 2024-06-20  6:51     ` MSavoritias
  2024-06-20 14:40       ` Simon Tournier
  0 siblings, 1 reply; 42+ messages in thread
From: MSavoritias @ 2024-06-20  6:51 UTC (permalink / raw)
  To: Simon Tournier; +Cc: Ian Eure, guix-devel

On Wed, 19 Jun 2024 16:41:33 +0200
Simon Tournier <zimon.toutoune@gmail.com> wrote:

> Hi MSavoritias, all,
> 
> Let me provide more context.
> 
> The concern started couple of months ago, to my knowledge.  And
> discussion is still on going.  So I think that’s incorrect to say “any
> result for over 6 months”.

Hey Simon,

I was talking about the perspective of a guix person that is not part
of maintainers or any mailing lists that these discussions are
happening. So from my side there hasn't been any updates from SWH or
from Guix either for the named issue or the LLM issue.

> Moreover, I feel you have a misunderstanding about HuggingFace and SWH
> partnership.  From the reading of public information, HuggingFace and
> BigCode trains on a subset of SWH source code archive.  I mean, it is
> a snapshot and to my knowledge, they provided the list of source code
> that had been used for training.
> 
> Not to avoid the question but from a pragmatic point of view, one
> might ask if the source code you write and do not want to be included
> in the training dataset, if this source code is concretely part of
> that training dataset.
> 
> HuggingFace is not training continuously with source code from SWH.
> 
> And technically, SWH is an archive i.e., the code is not stored hot.
> I do not know and I have not read all details by HuggingFace of their
> method; i.e., which kind of data they process – independent unique
> files, complete repository, etc.  What I know is that the piece when
> fetching from SWH is named SWH Vault; it requires to “cook” and
> prepare all the files that take times, from minutes to days.

Thats all fair and valid. Sadly tho SWH:
- Doesn't even mention on their website anything about what happens to
  my code and where. so there is provenance. (unless i start searching
  HuggingFace.
- The email from the director that was sent to me says explicitly that
  they don't see an issue with it being opt-out after the fact and
  embrase LLMs usage. So that seems to me that its already in there. 

> All that to say two key points:
> 
> 1. People behind SWH are well-aware about various sides of the
> concerns. As said, they are long-time free software supporters.  Be
> sure they have eared community concerns.  Some discussions are still
> pending because as explained, all sides of ethical questions needs to
> be cautious.
> 
> Please do not think it is ignored.
> 
> 
> 2. FWIW, I am in touch with SWH people – among other members from Guix
> community.  For instance, in order to feed the discussion, Roberto
> from SWH pointed to me this blog point by Bruce Perens:
> 
>     https://perens.com/2019/10/12/invasion-of-the-ethical-licenses/
> 
> Well, I do not know if the outcome will be aligned with your current
> opinion, but be sure that your concerns as the others raised by Guix
> community members are taking into account.

Thank you for giving me an honest and detailed answer.

I wish I could say this was encouraging but as things currently stand I
would like much more transparency about what is actually happening from
Guix and SWH. Because currently:
- The director seemed completely oblivious to any issues with LLMs or
  code harvesting without consent.
- Efraim seemed to have suggested that there hasn't been any
  communication and its even offtopic.
- Nothing has been written from Guix or SWH publicly about it and there
  are no mechanisms in place in the short term even to mitigate some of
  these things. (Which my next steps try to fix when I make the patches
  in a few weeks)

Regards,
MSavoritias
 
> Cheers,
> simon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20  6:36         ` MSavoritias
@ 2024-06-20 14:35           ` Ekaitz Zarraga
  2024-06-21  8:51             ` MSavoritias
  0 siblings, 1 reply; 42+ messages in thread
From: Ekaitz Zarraga @ 2024-06-20 14:35 UTC (permalink / raw)
  To: MSavoritias; +Cc: raingloom, Simon Tournier, Ian Eure, guix-devel

Hi,


On 2024-06-20 08:36, MSavoritias wrote:
> On Wed, 19 Jun 2024 17:46:08 +0200
> Ekaitz Zarraga <ekaitz@elenq.tech> wrote:
> 
>> On 2024-06-19 12:25, raingloom@riseup.net wrote:
>>> On 2024-06-19 11:54, Efraim Flashner wrote:
>>>> On Wed, Jun 19, 2024 at 12:13:38PM +0300, MSavoritias wrote:
>>>> ...
>>>> One of our packages, dbxfs, left Github a while ago and continued
>>>> development on a different forge. They adjusted their README to
>>>> disallow hosting of their code on Github. Based on this
>>>> restriction we have labeled later versions of the software as
>>>> non-free and have not updated the package. IMO saying that source
>>>> code cannot be uploaded to SWH would fall into the same category.
>>>
>>> No wonder more and more people are growing dissatisfied with the
>>> free software movement.
>>>    
>>
> Hey Ekaitz,
> 
> Please remember two things in the context of all of this:
> 1. Guix is not a software entity but it is made of people that want a
> safer, collaborative space to create things. These things may be code,
> a blog post or anything else as part of guix. Even a social network
> account. I am saying this because you only talked about Free Software
> in your message and not about people or different contexts.
> And we are talking about people here. Not code. Code is not alive.

I was specifically talking about the Free Software issue raised by 
Efraim and the message by Raingloom. And exactly what you point out is 
what I wanted separate as you very well did. Now we are talking about 
the people and about how things affect people, and that's a different 
matter I'm going to tackle below.

> 2. You seem to imply that Free Software or code is apolitical. (in the
> sense of social or state politics not) Which it is not. Nothing is.
> For example Free Software is explicitly pro-capitalist and
> pro-Google/big companies. I am not saying I disagree, but its good
> to keep in mind that politics exist and do exist always. And in the case

I'm not one of those people that think everything is politics but that's 
not a debate I want to open. Free Software can be understood from many 
ways. I don't think it's pro-capitalist, but pro-freedom, but that 
freedom affects the capitalists too, and it's a *value* they have. But 
freedom is also an anarchist value, and it can be an anti-capitalist 
value too it becomes more politic when you put more things around it. 
The issue I was trying to point is Free Software attracts many people 
from many different backgrounds and politics, and trying to push for one 
side defeats its purpose: making people stay together because they have 
some shared value.

>> There are many valid reasons why someone might criticize the Free
>> Software movement and people behind it, but making free software only
>> has 4 simple rules. If you don't comply with them you are not free
>> software anymore. It's as simple as that, and that simple it should
>> be.
>>
>> Free Software gives me the FREEDOM to print the code, make a roll
>> with it and shove it up my ass if I want to (and even distribute my
>> modified copies for other people to do so). The same freedom I have
>> to upload it to github. If you prevent me from doing one or the other
>> you are restricting my freedom and that's defeating the purpose of
>> free software and we cannot consider your code free software anymore.
>> The line is clear, and trying to pretend to be free software while
>> restricting people's freedoms (regardless of what they are) is absurd.
> 
> This is missing the context that GPL does indeed restrict people's
> freedom to license code as the see fit. Because it was written to
> further the political goals of FSF. It is on purpose. So we are already
> restricting the freedom of people to do what they want on purpose.

It does restrict your freedom but only if your goal is restrict other 
people's software freedom. I'd say the argument here was that GPL 
provides more absolute freedom in the current world than other licenses 
but I don't think the GPL was a very easy decision to make for the 
radical freedom fighters. That's why some people don't like it.

> And lets not forget
> "your freedom ends where the other persons freedom begins"
> and consent of course in the issue at hand.

Yes, but I don't think this is a matter Free Software needs to deal 
with. And my original message was around that.

Now, we should do something as a set of people that collaboratively work 
in a project. Probably not under the Free Software label, because what 
free software is is already pretty clear and well defined, but as 
something else, may that be Guix users and contributors, if we wish.

>>
>> The Free Software movement can be labeled (and is often labeled) as a
>> political movement but I'd say it's more of an ethical movement. It's
>> a way to share *values* and the value we share here is freedom. We
>> might or might not share other values, politics, religion or
>> anything, but as long as we put the freedom in the first place we
>> should agree that free software is better than any other software
>> model we have.
>>
>> There are bad actors in the world (say thieves, killers or... GitHub
>> and AI), and we can discuss about how we should deal with them but I
>> don't think the answer is putting our *values* aside but embrace them
>> harder (one value, freedom, in our case).
> 
> Definetily agree. The solution is not to embrace propietary software or
> restrict software. Its to write down some common social rules that are
> rooted in consent.
> 
>> If people is not happy with the Free Software movement because it
>> puts the freedom first, I can only understand it as people being mad
>> about Free Software because it's about software.
>>
>> For other values, we can start other initiatives I may or may not
>> agree more with, but if the value is freedom (in software), I don't
>> think there's any better way to push for it. But trying to disguise
>> other things inside of the Free Software is kind of dishonest.
> 
> Fair. I mean we already have CoC and channel descriptions. Idk if we
> have event guidelines/CoC yet but we should.
> 
>> I don't know, maybe I'm just a little bit tired.
> 
> No worries. I think it was very well said.
> 
> MSavoritias

That was just for clarifying my point wasn't against this discussion but 
to say that the decision Efraim took on dbxfs is not only correct but 
the only possible decision, and that it should be.

Now in Guix, I don't feel comfortable with the fact we are helping 
people use AI that doesn't respect the licenses of our work to be 
trained. I'm sick of it.

If they respected the licenses, I'd be ok with it. Since I accepted Free 
Software's social contract I'm open for anyone to use my code with any 
purpose (unless they don't respect people's freedom later).

Also, even if we don't do anything about it, Guix's codebase is public, 
so they could do it anyway, regardless of SWH, so there's not much we 
can do about that.

What we *can* do is raise our concerns to SWH, motivating them to be 
more strict with their collaboration with companies or with the terms of 
their collaboration. It's probably better that they are in our side in 
this battle than if we are alone. I think they are sensible to this 
issue so it shouldn't be hard to have a proper conversation with them 
and see if we can understand better what they do, how, in which terms 
and so on.

Maybe it's better that these AI companies reach our code through SWH 
with a well-written contract than letting them steal it from the 
internet without having them to sign anything.

I'm kind of just guessing there, but we are probably stronger that way.
Also, if we could make other distros to take part on this it would be a 
great way to be stronger.

In any case, I think SWH are more than sensible to this issue and I 
think their connections might be helpful to not only restrict this 
HugginFace from doing shady things but to start pushing for regulation 
for every AI company that uses our sweat for their purposes.

So, to come back to my original point: It's not the free software that 
needs to change. It's the regulation of AI companies that should, and 
the responsibility we demand from them. Legally and morally, they should 
be accountable of what they do, and that's the direction I'd like to 
approach this. Maybe it's not easy to change the regulation of the whole 
world, but we can try to push for it in Europe (we pioneered some 
related regulations before) first.

In summary, I don't think this is just a SWH is bad/good or Free 
Software is bad/good issue.

Best,
Ekaitz

PS: If there's action I'm open and ready for it, but I won't like this 
discussion to become an exercise of ethical bragging with no goals.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20  6:51     ` MSavoritias
@ 2024-06-20 14:40       ` Simon Tournier
  2024-06-21  9:08         ` MSavoritias
  0 siblings, 1 reply; 42+ messages in thread
From: Simon Tournier @ 2024-06-20 14:40 UTC (permalink / raw)
  To: MSavoritias; +Cc: Ian Eure, guix-devel

Hi MSavoritias, all,

On Thu, 20 Jun 2024 at 09:51, MSavoritias <email@msavoritias.me> wrote:

>> Not to avoid the question but from a pragmatic point of view, one
>> might ask if the source code you write and do not want to be included
>> in the training dataset, if this source code is concretely part of
>> that training dataset.

[...]

> Thats all fair and valid. Sadly tho SWH:
> -                                                                     
>                         there is provenance. (unless i start searching
>   HuggingFace.

Being concrete and explicit, could you please share:

 1. Which part of your code is included in the pretraining dataset?

    It’s easy, you can copy/paste a snippet and it returns the location
    from where it comes from.

    https://huggingface.co/spaces/bigcode/search-v2a


 2. What is your code that is included in SWH archive?

    Again, it’s easy: checkout some commit of your repository, then
    inside this repository, you can run:

    echo "https://archive.softwareheritage.org/swh:1:dir:$(guix hash -S git -f hex -H sha1 .)"

    Do not miss the ’.’ (dot) once entering the repository.  This
    command returns SWHID.  Other said, using this identifier, you might
    know if the repository is stored by SWH.  (Be careful with temporary
    artifacts as .go files or else.)

    Or you can also check for one specific content:

  $ echo "https://archive.softwareheritage.org/swh:1:cnt:$(guix hash -S git -f hex -H sha1 COPYING)"
  https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

    And the URL display the content of the file COPYING.  Here GPL 3
    license for instance.


 3. Where such source code from #2 and #3 is packaged by Guix?
 

That said, if the source is hosted on GitHub or GitLab.com or SourceHut
or CodeBerg or some other popular forges or even mirrored without your
consent on one of these, please consider that your code had been
ingested by ChatGPT without any mean to verify.  Obviously, that’s not
an argument to accept the situation with HuggingFace and I understand
that you do not want that your publicly release copyleft source code
could be reused by any LLM.

However, as said several times, rooting this willing of non-inclusion is
larger than your own willing once you publicly released such source code
under some copyleft license.  I hope we agree on that.

Again, I am not trying to avoid something.  And again, we all have heard
your points.  Nothing is ignored.  To my knowledge, the path forward is
not yet well-defined.

Since we are discussing at length with various different inputs, it
means that a common understanding and/or opinion does not seem obvious.


>> Well, I do not know if the outcome will be aligned with your current
>> opinion, but be sure that your concerns as the others raised by Guix
>> community members are taking into account.
>
> Thank you for giving me an honest and detailed answer.

I feel you are pushy on the topic and for what my opinion is worth, it
is not helpful to raise again and again that you want a way to opt-out.
Yeah, people got it. :-) And you are probably not alone, I guess.

It would help if you could provide a source code that your wrote and
answer the three criteria above: included in pretraining dataset,
included in SWH, packaged by Guix.

I do not have special information from SWH but I am sure SWH people are
working on the topic.  And again, maybe the outcome will not be aligned
with your opinion.  Another story.

Now, the other question you ask to Guix: do we continue to help SWH in
harvesting?  You propose to stop, IIUC.  Ok, we got it, too. :-) From my
point of view, the path forward is not to speak on the abstract but to
root on concrete numbers; it would help in bounding what we are speaking
about.

Concretely, if you would like to be able to opt-out, could you point:

1. the piece from the Guix source code you are the author?

2. source code you are the author that is packaged by Guix?

Again, I am not trying to avoid the discussion.  Instead, I would prefer
to root the discussion on concrete examples.  Then it would appear to me
easier to make progress.

As Greg or Ekaitz also wrote: opting out has implications on the meaning
of freedom behind “free software“.

IMHO, that’s not because we would like to opt-out that we could, would
be able to or allowed to.  Therefore, instead of holding opinions on the
abstract, let try to make progress and start on the concrete: which
piece of source code are we speaking about?

Cheers,
simon

   


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-19  8:36   ` Dale Mellor
@ 2024-06-20 17:00     ` Andreas Enge
  2024-06-20 18:42       ` Dale Mellor
  0 siblings, 1 reply; 42+ messages in thread
From: Andreas Enge @ 2024-06-20 17:00 UTC (permalink / raw)
  To: Dale Mellor; +Cc: guix-devel

Am Wed, Jun 19, 2024 at 09:36:29AM +0100 schrieb Dale Mellor:
>   No, it's not.  I use Guix as a tool to develop my own projects, private and
> personal for reasons I'm keeping to myself.  As part of that I write package
> definitions for them, and use the Guix machinery to build and test.  I *cannot*
> have Guix just giving my code away to anybody, that is just fundamentally wrong.
> 
>   I think at least there should be a /restricted/ license type available to
> package definitions, and the system absolutely should not give source code away
> from packages which use this (of course, they won't get into the official
> distribution, but that's fine).

Is there a misunderstanding here? The Guix software framework does not
communicate software that you work on to outsiders. As I understand it,
SWH looks at the Guix packages that are publicly available in the Guix
git repo, and then archives the corresponding source code of these packages.
By definition, this is free software (otherwise we would not package it),
and available from elsewhere on the Internet (the "uri" part of the
"source" field). So I think Guix does not actually do anything in this
context, and all this discussion is moot. (Well, I suppose we may encourage
SWH to archive these sources, and am personally very much in favour of it;
but they do not need us for archiving the sources.)

The goal of SWH is to archive all free software in the world, and if you
want to prevent your software from appearing in their collection, the only
reliable solution is to not publish it as free software (which apparently
is your approach, Dale, for the software you are talking about).

Andreas



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 17:00     ` Andreas Enge
@ 2024-06-20 18:42       ` Dale Mellor
  2024-06-20 20:54         ` Andreas Enge
  2024-06-20 21:27         ` Simon Tournier
  0 siblings, 2 replies; 42+ messages in thread
From: Dale Mellor @ 2024-06-20 18:42 UTC (permalink / raw)
  To: Andreas Enge; +Cc: guix-devel

On Thu, 2024-06-20 at 19:00 +0200, Andreas Enge wrote:
> Am Wed, Jun 19, 2024 at 09:36:29AM +0100 schrieb Dale Mellor:
> >   No, it's not.  I use Guix as a tool to develop my own projects, private
> > and
> > personal for reasons I'm keeping to myself.  As part of that I write package
> > definitions for them, and use the Guix machinery to build and test.  I
> > *cannot*
> > have Guix just giving my code away to anybody, that is just fundamentally
> > wrong.
> 
> Is there a misunderstanding here? The Guix software framework does not
> communicate software that you work on to outsiders.

I'm sure guix lint tried to push my code out to them the last time I tried.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 18:42       ` Dale Mellor
@ 2024-06-20 20:54         ` Andreas Enge
  2024-06-20 20:59           ` Ekaitz Zarraga
  2024-06-20 21:27         ` Simon Tournier
  1 sibling, 1 reply; 42+ messages in thread
From: Andreas Enge @ 2024-06-20 20:54 UTC (permalink / raw)
  To: Dale Mellor; +Cc: guix-devel

Am Thu, Jun 20, 2024 at 07:42:44PM +0100 schrieb Dale Mellor:
> I'm sure guix lint tried to push my code out to them the last time I tried.

Ah indeed, there is this in guix/lint.scm:

(define (check-archival package)
  "Check whether PACKAGE's source code is archived on Software Heritage.  If
it's not, and if its source code is a VCS snapshot, then send a \"save\"
request to Software Heritage.

It potentially calls this:
(define (save-package-source package)
  "Attempt to save the source of PACKAGE on SWH.  Return a list of warnings."

Which calls this from swh.scm:
(define* (save-origin url #:optional (type "git"))
  "Request URL to be saved."
  (call (swh-url "/api/1/origin/save" type "url" url) json->save-reply
        http-post*))

So it does not push code, but a URL from which the code can be downloaded.
Thus it requires the code to be available from the Internet; local code
is "safe" from SWH.

Now I do not know what will happen if you save your code as a git
repository at a hidden URL. For instance, does SWH check the license?
I would hope so.

There is documentation of this feature here:
   https://archive.softwareheritage.org/api/1/origin/save/doc/
which says this:
Depending of the provided origin url, the save request can either be:
- immediately accepted, for well known code hosting providers like for instance GitHub or GitLab
- rejected, in case the url is blacklisted by Software Heritage
- put in pending state until a manual check is done in order to determine if it can be loaded or not

So I suppose that if you submit a hidden, but publicly available URL
pointing to non-free code, the request will be "put in pending state",
manually checked and rejected, and maybe the URL added to the blacklist.

Andreas



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 20:54         ` Andreas Enge
@ 2024-06-20 20:59           ` Ekaitz Zarraga
  2024-06-20 21:12             ` Andreas Enge
  2024-06-21  8:41             ` Dale Mellor
  0 siblings, 2 replies; 42+ messages in thread
From: Ekaitz Zarraga @ 2024-06-20 20:59 UTC (permalink / raw)
  To: Andreas Enge, Dale Mellor; +Cc: guix-devel

Hi,

On 2024-06-20 22:54, Andreas Enge wrote:
> Am Thu, Jun 20, 2024 at 07:42:44PM +0100 schrieb Dale Mellor:
>> I'm sure guix lint tried to push my code out to them the last time I tried.
> 
> Ah indeed, there is this in guix/lint.scm:
> 
> (define (check-archival package)
>    "Check whether PACKAGE's source code is archived on Software Heritage.  If
> it's not, and if its source code is a VCS snapshot, then send a \"save\"
> request to Software Heritage.
> 
> It potentially calls this:
> (define (save-package-source package)
>    "Attempt to save the source of PACKAGE on SWH.  Return a list of warnings."
> 
> Which calls this from swh.scm:
> (define* (save-origin url #:optional (type "git"))
>    "Request URL to be saved."
>    (call (swh-url "/api/1/origin/save" type "url" url) json->save-reply
>          http-post*))
> 
> So it does not push code, but a URL from which the code can be downloaded.
> Thus it requires the code to be available from the Internet; local code
> is "safe" from SWH.
> 
> Now I do not know what will happen if you save your code as a git
> repository at a hidden URL. For instance, does SWH check the license?
> I would hope so.
> 
> There is documentation of this feature here:
>     https://archive.softwareheritage.org/api/1/origin/save/doc/
> which says this:
> Depending of the provided origin url, the save request can either be:
> - immediately accepted, for well known code hosting providers like for instance GitHub or GitLab
> - rejected, in case the url is blacklisted by Software Heritage
> - put in pending state until a manual check is done in order to determine if it can be loaded or not
> 
> So I suppose that if you submit a hidden, but publicly available URL
> pointing to non-free code, the request will be "put in pending state",
> manually checked and rejected, and maybe the URL added to the blacklist.
> 
> Andreas
> 
> 

For this specific case we could add some flag to the command line like 
`--do-not-archive` or something like that.

WDYT?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 20:59           ` Ekaitz Zarraga
@ 2024-06-20 21:12             ` Andreas Enge
  2024-06-21  8:41             ` Dale Mellor
  1 sibling, 0 replies; 42+ messages in thread
From: Andreas Enge @ 2024-06-20 21:12 UTC (permalink / raw)
  To: Ekaitz Zarraga; +Cc: Dale Mellor, guix-devel

Am Thu, Jun 20, 2024 at 10:59:41PM +0200 schrieb Ekaitz Zarraga:
> For this specific case we could add some flag to the command line like
> `--do-not-archive` or something like that.

guix lint -x archival

if I understand "guix lint --help" correctly.

Andreas



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 18:42       ` Dale Mellor
  2024-06-20 20:54         ` Andreas Enge
@ 2024-06-20 21:27         ` Simon Tournier
  1 sibling, 0 replies; 42+ messages in thread
From: Simon Tournier @ 2024-06-20 21:27 UTC (permalink / raw)
  To: Dale Mellor, Andreas Enge; +Cc: guix-devel

Hi,

On Thu, 20 Jun 2024 at 19:42, Dale Mellor <guix-devel-0brg6a@rdmp.org> wrote:

> I'm sure guix lint tried to push my code out to them the last time I
> tried.

Yes, it’s the checker ’archival’.

Therefore, running “guix lint -x archival” does not send any request to
SWH.

Cheers,
simon



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 20:59           ` Ekaitz Zarraga
  2024-06-20 21:12             ` Andreas Enge
@ 2024-06-21  8:41             ` Dale Mellor
  2024-06-21  9:19               ` MSavoritias
  1 sibling, 1 reply; 42+ messages in thread
From: Dale Mellor @ 2024-06-21  8:41 UTC (permalink / raw)
  To: Ekaitz Zarraga, Andreas Enge; +Cc: guix-devel

On Thu, 2024-06-20 at 22:59 +0200, Ekaitz Zarraga wrote:
> Hi,
> 
> On 2024-06-20 22:54, Andreas Enge wrote:
> > Am Thu, Jun 20, 2024 at 07:42:44PM +0100 schrieb Dale Mellor:
> > > I'm sure guix lint tried to push my code out to them the last time I
> > > tried.
> > 
> > Ah indeed, there is this in guix/lint.scm:
> > 
> > So it does not push code, but a URL from which the code can be downloaded.
> > Thus it requires the code to be available from the Internet; local code
> > is "safe" from SWH.

   But this is still leaking information.

> > Now I do not know what will happen if you save your code as a git
> > repository at a hidden URL. For instance, does SWH check the license?
> > I would hope so.

   Hope is not really good enough, there needs to be certainty in this.

> 
> For this specific case we could add some flag to the command line like 
> `--do-not-archive` or something like that.

   `-x archival` does it, but it is too easy to forget and once the cat is out
of the bag privacy is lost.  I really think this should be default behaviour, or
at least there should be a flag in the package definition.  I would still be
uncomfortable with the last option, as everyone would be relying on the
collective of Guix maintainers to not screw up and accidentally leak private
data.

Dale



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 14:35           ` Ekaitz Zarraga
@ 2024-06-21  8:51             ` MSavoritias
  0 siblings, 0 replies; 42+ messages in thread
From: MSavoritias @ 2024-06-21  8:51 UTC (permalink / raw)
  To: Ekaitz Zarraga; +Cc: raingloom, Simon Tournier, Ian Eure, guix-devel

On Thu, 20 Jun 2024 16:35:10 +0200
Ekaitz Zarraga <ekaitz@elenq.tech> wrote:

> > 2. You seem to imply that Free Software or code is apolitical. (in the
> > sense of social or state politics not) Which it is not. Nothing is.
> > For example Free Software is explicitly pro-capitalist and
> > pro-Google/big companies. I am not saying I disagree, but its good
> > to keep in mind that politics exist and do exist always. And in the case  
> 
> I'm not one of those people that think everything is politics but that's 
> not a debate I want to open. Free Software can be understood from many 
> ways. I don't think it's pro-capitalist, but pro-freedom, but that 
> freedom affects the capitalists too, and it's a *value* they have. But 
> freedom is also an anarchist value, and it can be an anti-capitalist 
> value too it becomes more politic when you put more things around it. 
> The issue I was trying to point is Free Software attracts many people 
> from many different backgrounds and politics, and trying to push for one 
> side defeats its purpose: making people stay together because they have 
> some shared value.

I agree up to point. There is a lot of ifs and buts here and the CoC covers some of the already.
Not every political opinion should be respected.

> >> There are many valid reasons why someone might criticize the Free
> >> Software movement and people behind it, but making free software only
> >> has 4 simple rules. If you don't comply with them you are not free
> >> software anymore. It's as simple as that, and that simple it should
> >> be.
> >>
> >> Free Software gives me the FREEDOM to print the code, make a roll
> >> with it and shove it up my ass if I want to (and even distribute my
> >> modified copies for other people to do so). The same freedom I have
> >> to upload it to github. If you prevent me from doing one or the other
> >> you are restricting my freedom and that's defeating the purpose of
> >> free software and we cannot consider your code free software anymore.
> >> The line is clear, and trying to pretend to be free software while
> >> restricting people's freedoms (regardless of what they are) is absurd.  
> > 
> > This is missing the context that GPL does indeed restrict people's
> > freedom to license code as the see fit. Because it was written to
> > further the political goals of FSF. It is on purpose. So we are already
> > restricting the freedom of people to do what they want on purpose.  
> 
> It does restrict your freedom but only if your goal is restrict other 
> people's software freedom. I'd say the argument here was that GPL 
> provides more absolute freedom in the current world than other licenses 
> but I don't think the GPL was a very easy decision to make for the 
> radical freedom fighters. That's why some people don't like it.

Sure I agree. My point was more that we already restrict stuff to make room for better things.
Same way the CoC restricts some people from participating so that our spaces can be safer for people to participate.
Its the tradeoffs you have to do. By allowing everybody to do whatever they want or allowing everybody to say whatever they want, you end losing everybody.
As you said yourself.

> > And lets not forget
> > "your freedom ends where the other persons freedom begins"
> > and consent of course in the issue at hand.  
> 
> Yes, but I don't think this is a matter Free Software needs to deal 
> with. And my original message was around that.
> 
> Now, we should do something as a set of people that collaboratively work 
> in a project. Probably not under the Free Software label, because what 
> free software is is already pretty clear and well defined, but as 
> something else, may that be Guix users and contributors, if we wish.

yep. I agree. And this is exactly what I wanted to do in my proposal in the first place :D

> >> The Free Software movement can be labeled (and is often labeled) as a
> >> political movement but I'd say it's more of an ethical movement. It's
> >> a way to share *values* and the value we share here is freedom. We
> >> might or might not share other values, politics, religion or
> >> anything, but as long as we put the freedom in the first place we
> >> should agree that free software is better than any other software
> >> model we have.
> >>
> >> There are bad actors in the world (say thieves, killers or... GitHub
> >> and AI), and we can discuss about how we should deal with them but I
> >> don't think the answer is putting our *values* aside but embrace them
> >> harder (one value, freedom, in our case).  
> > 
> > Definetily agree. The solution is not to embrace propietary software or
> > restrict software. Its to write down some common social rules that are
> > rooted in consent.
> >   
> >> If people is not happy with the Free Software movement because it
> >> puts the freedom first, I can only understand it as people being mad
> >> about Free Software because it's about software.
> >>
> >> For other values, we can start other initiatives I may or may not
> >> agree more with, but if the value is freedom (in software), I don't
> >> think there's any better way to push for it. But trying to disguise
> >> other things inside of the Free Software is kind of dishonest.  
> > 
> > Fair. I mean we already have CoC and channel descriptions. Idk if we
> > have event guidelines/CoC yet but we should.
> >   
> >> I don't know, maybe I'm just a little bit tired.  
> > 
> > No worries. I think it was very well said.
> > 
> > MSavoritias  
> 
> That was just for clarifying my point wasn't against this discussion but 
> to say that the decision Efraim took on dbxfs is not only correct but 
> the only possible decision, and that it should be.

I think our decisions should be a lot more based on context than dogma or some kind of immovable law. But that is just me and probably a discussion for another time.

> Now in Guix, I don't feel comfortable with the fact we are helping 
> people use AI that doesn't respect the licenses of our work to be 
> trained. I'm sick of it.
> 
> If they respected the licenses, I'd be ok with it. Since I accepted Free 
> Software's social contract I'm open for anyone to use my code with any 
> purpose (unless they don't respect people's freedom later).
> 
> Also, even if we don't do anything about it, Guix's codebase is public, 
> so they could do it anyway, regardless of SWH, so there's not much we 
> can do about that.

I mean sure. But the problem is that Guix actively gives them the source code which they use for the wrong purposes.
I wouldn't have a problem if it was on archiving. Just because somebody else is an asshole doesn't mean we have to be.

Also a lot of people don't see the Free Software social contract as GPL. They see it as a legal license.
Probably we could define some kind of Free Software contract on top but I am guessing that:
1. It would be against GPL, because GPL doesn't want anybody for any purpose to use your code. We would go public domain.
2. A lot of people probably couldn't accept it. See for example hostile forks even inside GNU that have happened.

> What we *can* do is raise our concerns to SWH, motivating them to be 
> more strict with their collaboration with companies or with the terms of 
> their collaboration. It's probably better that they are in our side in 
> this battle than if we are alone. I think they are sensible to this 
> issue so it shouldn't be hard to have a proper conversation with them 
> and see if we can understand better what they do, how, in which terms 
> and so on.

I agree. I don't want to burn any bridges. Which is why I made the proposal that I did. To put social pressure on them to actually respect consent.
 
> Maybe it's better that these AI companies reach our code through SWH 
> with a well-written contract than letting them steal it from the 
> internet without having them to sign anything.
> 
> I'm kind of just guessing there, but we are probably stronger that way.
> Also, if we could make other distros to take part on this it would be a 
> great way to be stronger.
> 
> In any case, I think SWH are more than sensible to this issue and I 
> think their connections might be helpful to not only restrict this 
> HugginFace from doing shady things but to start pushing for regulation 
> for every AI company that uses our sweat for their purposes.
> 
> So, to come back to my original point: It's not the free software that 
> needs to change. It's the regulation of AI companies that should, and 
> the responsibility we demand from them. Legally and morally, they should 
> be accountable of what they do, and that's the direction I'd like to 
> approach this. Maybe it's not easy to change the regulation of the whole 
> world, but we can try to push for it in Europe (we pioneered some 
> related regulations before) first.

Maybe. Then again this changes nothing to the current discussion.
That a system of code harvesting like SWH has needs to opt-in with consent.
Laws or not.

Then everybody can take the decision they think is based and give or not give their code to the LLM model :)

> In summary, I don't think this is just a SWH is bad/good or Free 
> Software is bad/good issue.
> 
> Best,
> Ekaitz
> 
> PS: If there's action I'm open and ready for it, but I won't like this 
> discussion to become an exercise of ethical bragging with no goals.

Please see my initial email for this thread for actional goals :)
That I plan to send a pr/mr/email for soonish.

MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-20 14:40       ` Simon Tournier
@ 2024-06-21  9:08         ` MSavoritias
  0 siblings, 0 replies; 42+ messages in thread
From: MSavoritias @ 2024-06-21  9:08 UTC (permalink / raw)
  To: Simon Tournier; +Cc: Ian Eure, guix-devel

On Thu, 20 Jun 2024 16:40:57 +0200
Simon Tournier <zimon.toutoune@gmail.com> wrote:

> Being concrete and explicit, could you please share:
> 
>  1. Which part of your code is included in the pretraining dataset?
> 
>     It’s easy, you can copy/paste a snippet and it returns the location
>     from where it comes from.
> 
>     https://huggingface.co/spaces/bigcode/search-v2a
> 
> 
>  2. What is your code that is included in SWH archive?
> 
>     Again, it’s easy: checkout some commit of your repository, then
>     inside this repository, you can run:
> 
>     echo "https://archive.softwareheritage.org/swh:1:dir:$(guix hash -S git -f hex -H sha1 .)"
> 
>     Do not miss the ’.’ (dot) once entering the repository.  This
>     command returns SWHID.  Other said, using this identifier, you might
>     know if the repository is stored by SWH.  (Be careful with temporary
>     artifacts as .go files or else.)
> 
>     Or you can also check for one specific content:
> 
>   $ echo "https://archive.softwareheritage.org/swh:1:cnt:$(guix hash -S git -f hex -H sha1 COPYING)"
>   https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
> 
>     And the URL display the content of the file COPYING.  Here GPL 3
>     license for instance.
> 
> 
>  3. Where such source code from #2 and #3 is packaged by Guix?

my code is not yet in Guix. The question and actions I said came about because I want to commit my package to Guix
but the minute I do it its shared without my consent with SWH.

> That said, if the source is hosted on GitHub or GitLab.com or SourceHut
> or CodeBerg or some other popular forges or even mirrored without your
> consent on one of these, please consider that your code had been
> ingested by ChatGPT without any mean to verify.  Obviously, that’s not
> an argument to accept the situation with HuggingFace and I understand
> that you do not want that your publicly release copyleft source code
> could be reused by any LLM.
> 
> However, as said several times, rooting this willing of non-inclusion is
> larger than your own willing once you publicly released such source code
> under some copyleft license.  I hope we agree on that.
> 
> Again, I am not trying to avoid something.  And again, we all have heard
> your points.  Nothing is ignored.  To my knowledge, the path forward is
> not yet well-defined.
> 
> Since we are discussing at length with various different inputs, it
> means that a common understanding and/or opinion does not seem obvious.

Let me put it more clearly. I am NOT asking for SWH to stop training the LLM. and I am NOT asking Guix to take a stance against LLMs.
and I do know that my code is going to be harvested anyway yeah.
what I DO ask is:
1. for SWH to make the sharing of code to the LLM strictly opt-in.
2. For Guix not to enable that behavior until that is fixed because it is against our social rules and CoC
The second step I have already outlined in the first emails some steps we could take to protect our package authors and show our disagreement.
And also in the xmpp chat it was shared that guix can just stop sending new package code until it an opt-in system is in place

> >> Well, I do not know if the outcome will be aligned with your current
> >> opinion, but be sure that your concerns as the others raised by Guix
> >> community members are taking into account.  
> >
> > Thank you for giving me an honest and detailed answer.  
> 
> I feel you are pushy on the topic and for what my opinion is worth, it
> is not helpful to raise again and again that you want a way to opt-out.
> Yeah, people got it. :-) And you are probably not alone, I guess.

Ah I am not pushing for what I want tho this is not how the thread started :)
The thread started with me saying what I am going to DO concertely about the SWH problem that is all.
I already have some practical things if you read it and I am going to start sending pr/mr/emails as i said soonish to move it forward.
I just wanted to give a heads up to the list so it doesn't come out of nowhere.

> I do not have special information from SWH but I am sure SWH people are
> working on the topic.  And again, maybe the outcome will not be aligned
> with your opinion.  Another story.
> 
> Now, the other question you ask to Guix: do we continue to help SWH in
> harvesting?  You propose to stop, IIUC.  Ok, we got it, too. :-) From my
> point of view, the path forward is not to speak on the abstract but to
> root on concrete numbers; it would help in bounding what we are speaking
> about.
> 
> Concretely, if you would like to be able to opt-out, could you point:
> 
> 1. the piece from the Guix source code you are the author?
> 
> 2. source code you are the author that is packaged by Guix?
> 
> Again, I am not trying to avoid the discussion.  Instead, I would prefer
> to root the discussion on concrete examples.  Then it would appear to me
> easier to make progress.
> 
> As Greg or Ekaitz also wrote: opting out has implications on the meaning
> of freedom behind “free software“.

I mean it does if you think that:
1. Guix doesn't have any social rules on top of the FSF definition (it does) and that it doesn't respect consent
2. That its not about the context of something. For example GPL or our CoC restrict freedom so that people can be more free to express themselves :)

> IMHO, that’s not because we would like to opt-out that we could, would
> be able to or allowed to.  Therefore, instead of holding opinions on the
> abstract, let try to make progress and start on the concrete: which
> piece of source code are we speaking about?

The softwares here -> https://sr.ht/~msavoritias/
Which the minute I add them to guix the code is going to be in SWH.
Not that this is about only my software but as the example you wanted.

MSavoritias

> Cheers,
> simon
> 
>    



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-21  8:41             ` Dale Mellor
@ 2024-06-21  9:19               ` MSavoritias
  2024-06-21 13:33                 ` Luis Felipe
  0 siblings, 1 reply; 42+ messages in thread
From: MSavoritias @ 2024-06-21  9:19 UTC (permalink / raw)
  To: Dale Mellor; +Cc: Ekaitz Zarraga, Andreas Enge, guix-devel

On Fri, 21 Jun 2024 09:41:10 +0100
Dale Mellor <guix-devel-0brg6a@rdmp.org> wrote:

> On Thu, 2024-06-20 at 22:59 +0200, Ekaitz Zarraga wrote:
> > Hi,
> > 
> > On 2024-06-20 22:54, Andreas Enge wrote:  
> > > Am Thu, Jun 20, 2024 at 07:42:44PM +0100 schrieb Dale Mellor:  
> > > > I'm sure guix lint tried to push my code out to them the last time I
> > > > tried.  
> > > 
> > > Ah indeed, there is this in guix/lint.scm:
> > > 
> > > So it does not push code, but a URL from which the code can be downloaded.
> > > Thus it requires the code to be available from the Internet; local code
> > > is "safe" from SWH.  
> 
>    But this is still leaking information.
> 
> > > Now I do not know what will happen if you save your code as a git
> > > repository at a hidden URL. For instance, does SWH check the license?
> > > I would hope so.  
> 
>    Hope is not really good enough, there needs to be certainty in this.
> 
> > 
> > For this specific case we could add some flag to the command line like 
> > `--do-not-archive` or something like that.  
> 
>    `-x archival` does it, but it is too easy to forget and once the cat is out
> of the bag privacy is lost.  I really think this should be default behaviour, or
> at least there should be a flag in the package definition.  I would still be
> uncomfortable with the last option, as everyone would be relying on the
> collective of Guix maintainers to not screw up and accidentally leak private
> data.
> 
> Dale

Yeah very much agree this should be the default behavior. Archiving should be opt-in to avoid any surprises for the person running it.
I am surprised it became default actually.

MSavoritias


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-21  9:19               ` MSavoritias
@ 2024-06-21 13:33                 ` Luis Felipe
  0 siblings, 0 replies; 42+ messages in thread
From: Luis Felipe @ 2024-06-21 13:33 UTC (permalink / raw)
  To: MSavoritias, Dale Mellor; +Cc: guix-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 959 bytes --]

Hi,

El 21/06/24 a las 9:19, MSavoritias escribió:
> On Fri, 21 Jun 2024 09:41:10 +0100
> Dale Mellor <guix-devel-0brg6a@rdmp.org> wrote:
>   
>>     `-x archival` does it, but it is too easy to forget and once the cat is out
>> of the bag privacy is lost.  I really think this should be default behaviour, or
>> at least there should be a flag in the package definition.  I would still be
>> uncomfortable with the last option, as everyone would be relying on the
>> collective of Guix maintainers to not screw up and accidentally leak private
>> data.
>>
>> Dale
> Yeah very much agree this should be the default behavior. Archiving should be opt-in to avoid any surprises for the person running it.
> I am surprised it became default actually.

MSavoritias, Dale, I think this is one specific point you could report 
as an issue (https://issues.guix.gnu.org/), track it with a number and 
maybe provide patches if you are able to.



[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 2881 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-18 18:08 ` Ian Eure
  2024-06-19 10:31   ` raingloom
@ 2024-06-27 12:27   ` Ludovic Courtès
  2024-06-27 15:30     ` Ian Eure
  1 sibling, 1 reply; 42+ messages in thread
From: Ludovic Courtès @ 2024-06-27 12:27 UTC (permalink / raw)
  To: Ian Eure; +Cc: guix-devel

Ian Eure <ian@retrospec.tv> skribis:

> Guix sends archive requests to SWH.  SWH gives that source code to
> HuggingFace.  HuggingFace demonstrably violates the licenses.

Which licenses?  As has been said previously, and you can verify for
yourself, it does not ingest code under copyleft licenses.

Ludo’.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-27 12:27   ` Ludovic Courtès
@ 2024-06-27 15:30     ` Ian Eure
  2024-06-27 16:48       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-06-27 16:58       ` Ludovic Courtès
  0 siblings, 2 replies; 42+ messages in thread
From: Ian Eure @ 2024-06-27 15:30 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

> Ian Eure <ian@retrospec.tv> skribis:
>
>> Guix sends archive requests to SWH.  SWH gives that source code 
>> to
>> HuggingFace.  HuggingFace demonstrably violates the licenses.
>
> Which licenses?  As has been said previously, and you can verify 
> for
> yourself, it does not ingest code under copyleft licenses.
>

While this is what their paper claims[1], it doesn’t appear to be 
true, since I can see my own GPL’d code in the training set.  I’ve 
since moved nearly all of my code off GitHub, but if you visit 
their "Am I in The Stack?" page[2] and enter my old username 
("ieure"), you will see pretty much every repository I ever hosted 
there, including both unlicensed and GPL’d code.  Some examples 
are hyperspace-el, nssh-el, tl1-mode, etc.  While there aren’t 
LICENSE files in those repos, the file headers of all clearly 
indicate that they’re GPL’d.

Unfortunately, there is no way to check for the presence of code 
in the training set except by GitHub username.

What I don’t know for certain is whether these are in the training 
set because they came from SWH, or because HuggingFace obtained 
them through other means.  Given that all the links for my GitHub 
username on that "Am I in The Stack" link back to SWH, it seems 
very likely that it came from them.

Thanks,

  — Ian

[1]: https://arxiv.org/pdf/2402.19173 "We also exclude 
copyleft-licensed code..."
[2]: https://huggingface.co/spaces/bigcode/in-the-stack


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-27 15:30     ` Ian Eure
@ 2024-06-27 16:48       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-06-27 16:58       ` Ludovic Courtès
  1 sibling, 0 replies; 42+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2024-06-27 16:48 UTC (permalink / raw)
  To: Ian Eure, Ludovic Courtès; +Cc: guix-devel

Hi Ian,

On Thu, Jun 27 2024, Ian Eure wrote:

> I’ve [...] moved nearly all of my code off GitHub

Me too.  I think closed it off from search crawlers.  No one should be
using Github anymore except for merge requests.  I left many years ago.

> if you visit their "Am I in The Stack?" page

Thank you for the link!

> pretty much every repository I ever hosted [is in] there, including
> both unlicensed and GPL’d code.

Mine too.  My software likewise has valid headers but no LICENSE files.

> Unfortunately, there is no way to check for the presence of code 
> in the training set except by GitHub username.

That's probably because you and I may eventually become part of a class
of copyright holders in a court action.

> What I don’t know for certain is whether these are in the training 
> set because they came from SWH, or because HuggingFace obtained 
> them through other means.

I can say for certain that none of my items (username "lechner") are in
Guix or elsewhere, so they probably did not originate via SWH.

Also, did you see the opt-out link at the bottom?  I considered it but
would on balance prefer to be part of the settlement class.

Kind regards
Felix


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
  2024-06-27 15:30     ` Ian Eure
  2024-06-27 16:48       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
@ 2024-06-27 16:58       ` Ludovic Courtès
  1 sibling, 0 replies; 42+ messages in thread
From: Ludovic Courtès @ 2024-06-27 16:58 UTC (permalink / raw)
  To: Ian Eure, MSavoritias; +Cc: guix-devel

Hi,

Ian Eure <ian@retrospec.tv> skribis:

> While this is what their paper claims[1], it doesn’t appear to be
> true, since I can see my own GPL’d code in the training set.  I’ve
> since moved nearly all of my code off GitHub, but if you visit their
> "Am I in The Stack?" page[2] and enter my old username ("ieure"), you
> will see pretty much every repository I ever hosted there, including
> both unlicensed and GPL’d code.

That’s not my experience: I looked for Guix and Coreutils, both GPL’d,
both mirrored on GitHub, and none of it is there.

> Some examples are hyperspace-el,
> nssh-el, tl1-mode, etc.  While there aren’t LICENSE files in those
> repos, the file headers of all clearly indicate that they’re GPL’d.

Well, not providing a COPYING/LICENSE file isn’t helping either: file
headers may not be all that clear to a parser.


At any rate, even though I’m watching this LLM trend with discontent
like many in the free software world, I believe this discussion is
missing the point and shooting the messenger(s).

One of the three missions of SWH is to share code—much like ftp.gnu.org.
That’s all they did.  Anyone can access the archive of SWH, for any
purpose.

HuggingFace trained “BigCode” on source SWH harvested from GitHub (a
subset of the SWH archive) and chose to abide by the principles put
forward by SWH in its Oct. 2023 statement.  HuggingFace didn’t have to
do that; they could have acted like Microsoft and all the “AI” companies
and just scrape everything without asking anyone—be it from SWH or from
other sources.


There is no “Software Heritage problem” and really, that very phrase and
the accusative tone in this thread is unwelcome and below our standards
for communication in Guix.  This has gone too far.  This is not the
place to further discuss the impact of using LLMs on free software, and
definitely not the place to throw unfounded accusations.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Next Steps For the Software Heritage Problem
@ 2024-06-28 18:01 Juliana Sims
  0 siblings, 0 replies; 42+ messages in thread
From: Juliana Sims @ 2024-06-28 18:01 UTC (permalink / raw)
  To: Ludovic Courtès, MSavoritias, ian; +Cc: guix-devel

Hey y'all,

I've avoided weighing in on this topic because I'm of two minds about 
it. Still, when members of the community raise concerns, it's important 
to take those concerns seriously. We must be careful how we address 
them because the opinions and concerns of any community member are as 
legitimate as those of any other.

This conversation has at times been contentious. People have not always 
used the most diplomatic language. And yet, there has been a thorough 
discussion of this topic. The conclusion appears to be that Guix cannot 
make changes in relation to SWH. It's clear there is no more room for 
productive conversation. I therefore echo Ludo's request to let this 
topic drop.

I want to express my gratitude for a community where people are able to 
express their concerns and have them taken seriously, regardless of who 
they are. Let's not lose that. Let's not forget that, even when 
passions are high, we all want Guix to succeed and have a healthy 
community, and we all work to that end as best as we can with the 
information and resources available to us.

Best,
Juli




^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-06-28 18:03 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-19  7:52 Next Steps For the Software Heritage Problem Simon Tournier
2024-06-19  9:13 ` MSavoritias
2024-06-19  9:54   ` Efraim Flashner
2024-06-19 10:25     ` raingloom
2024-06-19 15:46       ` Ekaitz Zarraga
2024-06-20  6:36         ` MSavoritias
2024-06-20 14:35           ` Ekaitz Zarraga
2024-06-21  8:51             ` MSavoritias
2024-06-19 10:34     ` MSavoritias
2024-06-19 14:41   ` Simon Tournier
2024-06-20  6:51     ` MSavoritias
2024-06-20 14:40       ` Simon Tournier
2024-06-21  9:08         ` MSavoritias
  -- strict thread matches above, loose matches on Subject: below --
2024-06-28 18:01 Juliana Sims
2024-06-18 17:12 Andy Tai
2024-06-18 18:08 ` Ian Eure
2024-06-19 10:31   ` raingloom
2024-06-27 12:27   ` Ludovic Courtès
2024-06-27 15:30     ` Ian Eure
2024-06-27 16:48       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-27 16:58       ` Ludovic Courtès
2024-06-18  8:37 MSavoritias
2024-06-18 14:19 ` Ian Eure
2024-06-19  8:36   ` Dale Mellor
2024-06-20 17:00     ` Andreas Enge
2024-06-20 18:42       ` Dale Mellor
2024-06-20 20:54         ` Andreas Enge
2024-06-20 20:59           ` Ekaitz Zarraga
2024-06-20 21:12             ` Andreas Enge
2024-06-21  8:41             ` Dale Mellor
2024-06-21  9:19               ` MSavoritias
2024-06-21 13:33                 ` Luis Felipe
2024-06-20 21:27         ` Simon Tournier
2024-06-18 16:21 ` Greg Hogan
2024-06-18 16:33   ` MSavoritias
2024-06-18 17:31     ` Greg Hogan
2024-06-18 17:57       ` Ian Eure
2024-06-19  7:01       ` MSavoritias
2024-06-19  9:57         ` Efraim Flashner
2024-06-20  2:56         ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-20  5:18           ` MSavoritias
2024-06-19 10:10 ` Efraim Flashner

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.