unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Kaelyn <kaelyn.alexi@protonmail.com>
To: guix-devel <guix-devel@gnu.org>
Subject: Re: Concerns/questions around Software Heritage Archive
Date: Mon, 18 Mar 2024 16:27:24 +0000	[thread overview]
Message-ID: <rtj9YLsLTYZEMdtufLqorh5X0mk_voKi1ByA55BdCEYgfucHCmwVPvGw9vC1kNPsI_ZALDdOOYC5U3tgk1kCiOlMO6gZrUHi3BUVacyeKgo=@protonmail.com> (raw)
In-Reply-To: <87a5mvyjl4.fsf@gmail.com>

On Monday, March 18th, 2024 at 2:28 AM, Simon Tournier <zimon.toutoune@gmail.com> wrote:

> 
> Hi,
> 
> On sam., 16 mars 2024 at 08:52, Ian Eure ian@retrospec.tv wrote:
> 
> > They appear to be using the archive to build LLMs:
> > https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
> 
> 
> About LLM, Software Heritage made a clear statement:
> 
> https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code
> 
> Quoting:
> 
> We feel that the question is no longer whether LLMs for code
> should be built. They are already being built, independently of
> what we do, and there is no turning back. The real question is
> how they should be built and whom they should benefit.
> 
> Principles:
> 
> 1. Knowledge derived from the Software Heritage archive must be
> given back to humanity, rather than monopolized for private
> gain. The resulting machine learning models must be made available
> under a suitable open license, together with the documentation and
> toolings needed to use them.
> 
> 2. The initial training data extracted from the Software Heritage
> archive must be fully and precisely identified by, for example,
> publishing the corresponding SWHID identifiers (note that, in the
> context of Software Heritage, public availability of the initial
> training data is a given: anyone can obtain it from the
> archive). This will enable use cases such as: studying biases
> (fairness), verifying if a code of interest was present in the
> training data (transparency), and providing appropriate attribution
> when generated code bears resemblance to training data (credit),
> among others.
> 
> 3. Mechanisms should be established, where possible, for authors to
> exclude their archived code from the training inputs before model
> training begins.
> 
> I hope it clarifies your concerns to some extent.
> 
> 
> Moreover, you wrote: « I want absolutely nothing to do with them. »
> 
> Maybe there is a misunderstanding on your side about what “free
> software” and GPL means because once “free software”, you cannot prevent
> people to use “your” free software for any purposes you dislike.
> 
> If you want to bound the use cases of the software you create, you need
> to explicitly specify that in the license. And if you do, your software
> will not be considered as “free software”.
> 
> That’s the double sword of “free software”. :-)

Hi,

I want to stress that I am not a lawyer, but my (possiblibly outdated) understanding of what machine learning models can and cannot do with regards to their training data, and a reading of parts of the GPL 2 and 3, suggest that at best the SWH's LLM is in a legal grey area and at worst directly violates the license of GPL code that it ingests for training. As such, I don't think it is accurate to say "you cannot prevent people to use “your” free software for any purposes you dislike" in response to concerns about automatic inclusion of free software into LLM training sets. Specifically, my understanding (as of a few years ago) is that LLMs have difficulty tracing and atttributing various aspects of its training to specific inputs, which seems to be in violation of of e.g. Sections 5 and 6 of the GPL. Specific quotes from those sections https://www.gnu.org/licenses/gpl-3.0.html:

From section 5:
> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
> 
>     a) The work must carry prominent notices stating that you modified it, and giving a relevant date.
>     b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”.
>     c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.
>     d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so.

and from Section 6:
> You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways:
> 
>     a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange.
>     b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge.
>     c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b.
>     d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements.
>     e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d.

And from the GPL 2 text at https://www.gnu.org/licenses/old-licenses/gpl-2.0.html:

> 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
> 
>     a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. 
>     b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. 
>     c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) 
> 
> These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.
> 
> Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.
> 
> In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.
> 
> 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:
> 
>     a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, 
>     b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, 
>     c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) 
> 
> The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
> 
> If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.
> 
> 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 

Again, I want to emphasize IANAL. As a layman, my understanding of ML model training is that it cannot maintain enough of a trace between GPLed input code and its (modified) use in the output to maintain the licensing and distribution requirements from either the GPL 3 sections above or the GPL 2 sections 2 and 3. I also believe that section 4 of the GPL 2 directly applies to these LLM code models.

There is also the potential licensing issues of mixing (potentially) incompatible licenses in the training data sets, such as GPL and CDDL code, with no way to distinguish or separate the (arguably) modified sources from each.

Just my $0.02 USD on the LLM side of matter, as much of the discussion seems to be around the cost vs benefit of rewriting the git history for updating personally identifying information.

Cheers,
Kaelyn

> 
> Cheers,
> simon


  parent reply	other threads:[~2024-03-18 16:28 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-16 15:52 Concerns/questions around Software Heritage Archive Ian Eure
2024-03-16 17:50 ` Christopher Baines
2024-03-16 18:24   ` MSavoritias
2024-03-16 19:08     ` Christopher Baines
2024-03-16 19:45     ` Tomas Volf
2024-03-17  7:06       ` MSavoritias
2024-03-16 19:06   ` Ian Eure
2024-03-16 19:49     ` Tomas Volf
2024-03-16 23:16   ` Vivien Kraus
2024-03-16 23:27     ` Tomas Volf
     [not found]     ` <EoCuAq3N681mOIAh7ptCyXiyscM9R0iPDBWId1eS4EbTJ2-ARWNfGuqtXIvmqcJNBl1SQvMM4X6-GiC5LiUv4TJv6J4ritPA3uZ2JBwkAzQ=@protonmail.com>
2024-03-16 23:40       ` Fw: " Ryan Prior
2024-03-16 17:58 ` MSavoritias
2024-03-18  9:50   ` Please hold your horses Simon Tournier
2024-03-16 21:37 ` Concerns/questions around Software Heritage Archive Ryan Prior
2024-03-17  9:39   ` Lars-Dominik Braun
2024-03-17  9:47     ` MSavoritias
2024-03-17 11:53       ` paul
2024-03-17 11:57         ` MSavoritias
2024-03-17 14:57           ` Richard Sent
2024-03-17 16:28           ` Ian Eure
2024-03-17 12:51         ` Tomas Volf
2024-03-17 23:56           ` Attila Lendvai
2024-03-20 15:25         ` contributor uuid (was Re: Concerns/questions around Software Heritage Archive) bae66428a8ad58eafaa98cb0ab2e512f045974ecf4bf947e32096fae574d99c6
2024-03-17 16:20       ` Concerns/questions around Software Heritage Archive Ian Eure
2024-03-17 16:55         ` MSavoritias
2024-03-18 14:04     ` pinoaffe
2024-03-17 13:03 ` Olivier Dion
2024-03-17 17:57 ` Ludovic Courtès
2024-03-20 17:22   ` the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive) Giovanni Biscuolo
2024-03-21  6:12     ` MSavoritias
2024-03-21 10:49       ` Attila Lendvai
2024-03-21 11:51       ` pelzflorian (Florian Pelz)
2024-03-21 11:52       ` pinoaffe
2024-03-21 15:08         ` Giovanni Biscuolo
2024-03-21 15:11           ` MSavoritias
2024-03-21 22:11             ` Philip McGrath
2024-03-21 16:17           ` pinoaffe
2024-03-21 15:23       ` Hartmut Goebel
2024-03-21 15:27         ` MSavoritias
2024-03-21 15:54           ` Ekaitz Zarraga
2024-03-22  4:33           ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-03-21 16:18         ` Efraim Flashner
2024-03-21 16:23         ` pinoaffe
2024-03-18  9:28 ` Concerns/questions around Software Heritage Archive Simon Tournier
2024-03-18 11:47   ` MSavoritias
2024-03-18 13:12     ` Simon Tournier
2024-03-18 14:00       ` MSavoritias
2024-03-18 14:32         ` Simon Tournier
2024-03-18 16:27   ` Kaelyn [this message]
2024-03-18 17:39     ` Daniel Littlewood
2024-03-18 20:38     ` Olivier Dion
2024-03-18 19:38   ` Ian Eure
2024-03-18 22:02     ` Ludovic Courtès
2024-03-19 10:58     ` Simon Tournier
2024-03-19 15:37       ` Ian Eure
2024-03-18 11:14 ` Content-Addressed system and history? Simon Tournier
2024-04-20 18:48 ` Concerns/questions around Software Heritage Archive Ian Eure
2024-05-01 15:29   ` Ian Eure
2024-05-01 15:41     ` Tomas Volf
2024-05-02 10:28   ` Ludovic Courtès
2024-05-09 16:00     ` Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='rtj9YLsLTYZEMdtufLqorh5X0mk_voKi1ByA55BdCEYgfucHCmwVPvGw9vC1kNPsI_ZALDdOOYC5U3tgk1kCiOlMO6gZrUHi3BUVacyeKgo=@protonmail.com' \
    --to=kaelyn.alexi@protonmail.com \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).