From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id AIK0D65r+GX1dwAAe85BDQ:P1 (envelope-from ) for ; Mon, 18 Mar 2024 17:28:30 +0100 Received: from aspmx1.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id AIK0D65r+GX1dwAAe85BDQ (envelope-from ) for ; Mon, 18 Mar 2024 17:28:30 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=protonmail.com header.s=protonmail3 header.b="GQAn/MwR"; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=quarantine) header.from=protonmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1710779309; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=VjgapqxiE7jRHNL5VnM4TEel15Kf6CtSMYlOb7UIzYs=; b=WNWnj2IY8jFn2BOsnngaYfetB7FNy01auRrmUXcwh3ClsayI7XntcvraRJkLXi9oBUocdc gqg8qnYuyiKoLPAwXIr2zlHoZANtwnAVCta+9DZiydF12ORu5zwKrGreSllBgtyjxNAjmX VQMQGcG6NrjbGo3+HM8ceKsqis3KUX6TVUfPl706DXRvV/zw9nM1k6fBdSYJyb0lrIwhUJ +KnDf9+W/1rLrlL2exXpiD5yzh6xxT5RBZU6T6YDigfPIxGY6/dmx/rU8BsiaQLkoZpN80 9NMmGdyxNTzHSksqQY6ayUemBk8Jnz6wIIcbqgVoIsOCjcIiiUz66fJhV8RuDQ== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1710779309; a=rsa-sha256; cv=none; b=HQFHINWAL66jkJN37vfeqWptGTz5/fKvsiMCDyNswZpE7xtbRFOK05of2+gW7h+4+UlnF8 UGxPmIgPq5Ieob4SlDFLtFlNZPXM0hZudZfj/9Uc30kQlGhdHX4+3bGOX8+OtZ3aBmJHT3 BWDA/nsmqaPcaZFvXIcSUWQ4IuopMdRcC/EG8z6sCJmA29n/TBYEj0WqC7Pj83vqdUU6ig dAst+SWsJszlbmtsJz+igB/QdsUDGq16VkyOQlv5YKEooUNzwWACV9Iknc8LQ809vXzVOI SvY1Hfn9n6zSLXSoXp2EE+tDA1c2wVZlgVVWxoV7ljxQlCz5O1qpC7o47tZZHg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=protonmail.com header.s=protonmail3 header.b="GQAn/MwR"; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=quarantine) header.from=protonmail.com Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id B4E986C753 for ; Mon, 18 Mar 2024 17:28:29 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rmFqB-0002LP-2I; Mon, 18 Mar 2024 12:27:59 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rmFq2-0002Gb-UW for guix-devel@gnu.org; Mon, 18 Mar 2024 12:27:51 -0400 Received: from mail-40134.protonmail.ch ([185.70.40.134]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rmFpz-0006dX-6c for guix-devel@gnu.org; Mon, 18 Mar 2024 12:27:50 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail3; t=1710779261; x=1711038461; bh=VjgapqxiE7jRHNL5VnM4TEel15Kf6CtSMYlOb7UIzYs=; h=Date:To:From:Subject:Message-ID:In-Reply-To:References: Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID: Message-ID:BIMI-Selector; b=GQAn/MwRfkXkzyp4xXTDRwVyEpfIoKAaZxtD9cIzk9fblCap83cvis7TZYRJKKAk0 dRF6Tfb76Jq13F6KJ/foYsL1rG8C/sENMn8mJkXZZ6g5yfqfrlXui/xyfNeOzD2IA+ JHd8k/Dxg4VinsGlvrBJJJx8RFzFJelZBo5HpUr4jGYn2N+r5KtlzF6CZi6pqdxJzX LPXLgUSg0bzA3RebMlwSUfC+QqOOwxYUxvFjrIpsLJtRwtpeU+QbrmoS9z1m9eslH8 TtsyzJ3akbVxGNsINJBW/Tzfz5gS1pVSIF6yRvf26JS6a9XNGjAsZm9fjFtYvCJ9dC Cpb++brFkk8Ag== Date: Mon, 18 Mar 2024 16:27:24 +0000 To: guix-devel From: Kaelyn Subject: Re: Concerns/questions around Software Heritage Archive Message-ID: In-Reply-To: <87a5mvyjl4.fsf@gmail.com> References: <87il1mupco.fsf@meson> <87a5mvyjl4.fsf@gmail.com> Feedback-ID: 34709329:user:proton MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=185.70.40.134; envelope-from=kaelyn.alexi@protonmail.com; helo=mail-40134.protonmail.ch X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Spam-Score: -9.65 X-Spam-Score: -9.65 X-Migadu-Queue-Id: B4E986C753 X-Migadu-Scanner: mx11.migadu.com X-TUID: wTcUb9hY1wtT On Monday, March 18th, 2024 at 2:28 AM, Simon Tournier wrote: >=20 > Hi, >=20 > On sam., 16 mars 2024 at 08:52, Ian Eure ian@retrospec.tv wrote: >=20 > > They appear to be using the archive to build LLMs: > > https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcod= er2/ >=20 >=20 > About LLM, Software Heritage made a clear statement: >=20 > https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code >=20 > Quoting: >=20 > We feel that the question is no longer whether LLMs for code > should be built. They are already being built, independently of > what we do, and there is no turning back. The real question is > how they should be built and whom they should benefit. >=20 > Principles: >=20 > 1. Knowledge derived from the Software Heritage archive must be > given back to humanity, rather than monopolized for private > gain. The resulting machine learning models must be made available > under a suitable open license, together with the documentation and > toolings needed to use them. >=20 > 2. The initial training data extracted from the Software Heritage > archive must be fully and precisely identified by, for example, > publishing the corresponding SWHID identifiers (note that, in the > context of Software Heritage, public availability of the initial > training data is a given: anyone can obtain it from the > archive). This will enable use cases such as: studying biases > (fairness), verifying if a code of interest was present in the > training data (transparency), and providing appropriate attribution > when generated code bears resemblance to training data (credit), > among others. >=20 > 3. Mechanisms should be established, where possible, for authors to > exclude their archived code from the training inputs before model > training begins. >=20 > I hope it clarifies your concerns to some extent. >=20 >=20 > Moreover, you wrote: =C2=AB I want absolutely nothing to do with them. = =C2=BB >=20 > Maybe there is a misunderstanding on your side about what =E2=80=9Cfree > software=E2=80=9D and GPL means because once =E2=80=9Cfree software= =E2=80=9D, you cannot prevent > people to use =E2=80=9Cyour=E2=80=9D free software for any purposes you d= islike. >=20 > If you want to bound the use cases of the software you create, you need > to explicitly specify that in the license. And if you do, your software > will not be considered as =E2=80=9Cfree software=E2=80=9D. >=20 > That=E2=80=99s the double sword of =E2=80=9Cfree software=E2=80=9D. :-) Hi, I want to stress that I am not a lawyer, but my (possiblibly outdated) unde= rstanding of what machine learning models can and cannot do with regards to= their training data, and a reading of parts of the GPL 2 and 3, suggest th= at at best the SWH's LLM is in a legal grey area and at worst directly viol= ates the license of GPL code that it ingests for training. As such, I don't= think it is accurate to say "you cannot prevent people to use =E2=80=9Cyou= r=E2=80=9D free software for any purposes you dislike" in response to conce= rns about automatic inclusion of free software into LLM training sets. Spec= ifically, my understanding (as of a few years ago) is that LLMs have diffic= ulty tracing and atttributing various aspects of its training to specific i= nputs, which seems to be in violation of of e.g. Sections 5 and 6 of the GP= L. Specific quotes from those sections https://www.gnu.org/licenses/gpl-3.0= .html: >From section 5: > You may convey a work based on the Program, or the modifications to produ= ce it from the Program, in the form of source code under the terms of secti= on 4, provided that you also meet all of these conditions: >=20 > a) The work must carry prominent notices stating that you modified it= , and giving a relevant date. > b) The work must carry prominent notices stating that it is released = under this License and any conditions added under section 7. This requireme= nt modifies the requirement in section 4 to =E2=80=9Ckeep intact all notice= s=E2=80=9D. > c) You must license the entire work, as a whole, under this License t= o anyone who comes into possession of a copy. This License will therefore a= pply, along with any applicable section 7 additional terms, to the whole of= the work, and all its parts, regardless of how they are packaged. This Lic= ense gives no permission to license the work in any other way, but it does = not invalidate such permission if you have separately received it. > d) If the work has interactive user interfaces, each must display App= ropriate Legal Notices; however, if the Program has interactive interfaces = that do not display Appropriate Legal Notices, your work need not make them= do so. and from Section 6: > You may convey a covered work in object code form under the terms of sect= ions 4 and 5, provided that you also convey the machine-readable Correspond= ing Source under the terms of this License, in one of these ways: >=20 > a) Convey the object code in, or embodied in, a physical product (inc= luding a physical distribution medium), accompanied by the Corresponding So= urce fixed on a durable physical medium customarily used for software inter= change. > b) Convey the object code in, or embodied in, a physical product (inc= luding a physical distribution medium), accompanied by a written offer, val= id for at least three years and valid for as long as you offer spare parts = or customer support for that product model, to give anyone who possesses th= e object code either (1) a copy of the Corresponding Source for all the sof= tware in the product that is covered by this License, on a durable physical= medium customarily used for software interchange, for a price no more than= your reasonable cost of physically performing this conveying of source, or= (2) access to copy the Corresponding Source from a network server at no ch= arge. > c) Convey individual copies of the object code with a copy of the wri= tten offer to provide the Corresponding Source. This alternative is allowed= only occasionally and noncommercially, and only if you received the object= code with such an offer, in accord with subsection 6b. > d) Convey the object code by offering access from a designated place = (gratis or for a charge), and offer equivalent access to the Corresponding = Source in the same way through the same place at no further charge. You nee= d not require recipients to copy the Corresponding Source along with the ob= ject code. If the place to copy the object code is a network server, the Co= rresponding Source may be on a different server (operated by you or a third= party) that supports equivalent copying facilities, provided you maintain = clear directions next to the object code saying where to find the Correspon= ding Source. Regardless of what server hosts the Corresponding Source, you = remain obligated to ensure that it is available for as long as needed to sa= tisfy these requirements. > e) Convey the object code using peer-to-peer transmission, provided y= ou inform other peers where the object code and Corresponding Source of the= work are being offered to the general public at no charge under subsection= 6d. And from the GPL 2 text at https://www.gnu.org/licenses/old-licenses/gpl-2.= 0.html: > 2. You may modify your copy or copies of the Program or any portion of it= , thus forming a work based on the Program, and copy and distribute such mo= difications or work under the terms of Section 1 above, provided that you a= lso meet all of these conditions: >=20 > a) You must cause the modified files to carry prominent notices stati= ng that you changed the files and the date of any change.=20 > b) You must cause any work that you distribute or publish, that in wh= ole or in part contains or is derived from the Program or any part thereof,= to be licensed as a whole at no charge to all third parties under the term= s of this License.=20 > c) If the modified program normally reads commands interactively when= run, you must cause it, when started running for such interactive use in t= he most ordinary way, to print or display an announcement including an appr= opriate copyright notice and a notice that there is no warranty (or else, s= aying that you provide a warranty) and that users may redistribute the prog= ram under these conditions, and telling the user how to view a copy of this= License. (Exception: if the Program itself is interactive but does not nor= mally print such an announcement, your work based on the Program is not req= uired to print an announcement.)=20 >=20 > These requirements apply to the modified work as a whole. If identifiable= sections of that work are not derived from the Program, and can be reasona= bly considered independent and separate works in themselves, then this Lice= nse, and its terms, do not apply to those sections when you distribute them= as separate works. But when you distribute the same sections as part of a = whole which is a work based on the Program, the distribution of the whole m= ust be on the terms of this License, whose permissions for other licensees = extend to the entire whole, and thus to each and every part regardless of w= ho wrote it. >=20 > Thus, it is not the intent of this section to claim rights or contest you= r rights to work written entirely by you; rather, the intent is to exercise= the right to control the distribution of derivative or collective works ba= sed on the Program. >=20 > In addition, mere aggregation of another work not based on the Program wi= th the Program (or with a work based on the Program) on a volume of a stora= ge or distribution medium does not bring the other work under the scope of = this License. >=20 > 3. You may copy and distribute the Program (or a work based on it, under = Section 2) in object code or executable form under the terms of Sections 1 = and 2 above provided that you also do one of the following: >=20 > a) Accompany it with the complete corresponding machine-readable sour= ce code, which must be distributed under the terms of Sections 1 and 2 abov= e on a medium customarily used for software interchange; or,=20 > b) Accompany it with a written offer, valid for at least three years,= to give any third party, for a charge no more than your cost of physically= performing source distribution, a complete machine-readable copy of the co= rresponding source code, to be distributed under the terms of Sections 1 an= d 2 above on a medium customarily used for software interchange; or,=20 > c) Accompany it with the information you received as to the offer to = distribute corresponding source code. (This alternative is allowed only for= noncommercial distribution and only if you received the program in object = code or executable form with such an offer, in accord with Subsection b abo= ve.)=20 >=20 > The source code for a work means the preferred form of the work for makin= g modifications to it. For an executable work, complete source code means a= ll the source code for all modules it contains, plus any associated interfa= ce definition files, plus the scripts used to control compilation and insta= llation of the executable. However, as a special exception, the source code= distributed need not include anything that is normally distributed (in eit= her source or binary form) with the major components (compiler, kernel, and= so on) of the operating system on which the executable runs, unless that c= omponent itself accompanies the executable. >=20 > If distribution of executable or object code is made by offering access t= o copy from a designated place, then offering equivalent access to copy the= source code from the same place counts as distribution of the source code,= even though third parties are not compelled to copy the source along with = the object code. >=20 > 4. You may not copy, modify, sublicense, or distribute the Program except= as expressly provided under this License. Any attempt otherwise to copy, m= odify, sublicense or distribute the Program is void, and will automatically= terminate your rights under this License. However, parties who have receiv= ed copies, or rights, from you under this License will not have their licen= ses terminated so long as such parties remain in full compliance.=20 Again, I want to emphasize IANAL. As a layman, my understanding of ML model= training is that it cannot maintain enough of a trace between GPLed input = code and its (modified) use in the output to maintain the licensing and dis= tribution requirements from either the GPL 3 sections above or the GPL 2 se= ctions 2 and 3. I also believe that section 4 of the GPL 2 directly applies= to these LLM code models. There is also the potential licensing issues of mixing (potentially) incomp= atible licenses in the training data sets, such as GPL and CDDL code, with = no way to distinguish or separate the (arguably) modified sources from each= . Just my $0.02 USD on the LLM side of matter, as much of the discussion seem= s to be around the cost vs benefit of rewriting the git history for updatin= g personally identifying information. Cheers, Kaelyn >=20 > Cheers, > simon