From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id UKGwMGyh+GXSfQEAe85BDQ:P1 (envelope-from ) for ; Mon, 18 Mar 2024 21:17:49 +0100 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id UKGwMGyh+GXSfQEAe85BDQ (envelope-from ) for ; Mon, 18 Mar 2024 21:17:48 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=retrospec.tv header.s=fm1 header.b=WXRMaZkG; dkim=pass header.d=messagingengine.com header.s=fm2 header.b="M VJgRzh"; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1710793068; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=rpmGe5Z+74UL8QRRx+iGLGqLjEx7lhfr46elWM2/icE=; b=a9kisc9aJu224chePpH8vq0A1gh7eiCn5SCeTFCoNK+Y+7fTVsF5heNQbGB8MSbRnqQPHF SwG9kGYtMfqB5C60KQt+DDTj4kvs7a9rr1QKft6E9aPHd3+OGMb7alCtpeK6rGbfQc9Y3c 2wSeqipKfFKOd44UcsryH6E+KAEBHVS10mwopmv3OV9edPV6ffSJipx78fHFEkr6cLUokb T1ZaXo5/sl94NJmz4ccX/bFrg0X5D5IldI7CtaR7sH25ZXrot0au4u8UjL/g76x5LmIbjF F9MLfuTveXv3s5QZe4eIYTEAmXo8vwGS3MVUBw7DHmC/DwLFdU3McpR+IX+D8Q== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=retrospec.tv header.s=fm1 header.b=WXRMaZkG; dkim=pass header.d=messagingengine.com header.s=fm2 header.b="M VJgRzh"; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=none ARC-Seal: i=1; s=key1; d=yhetil.org; t=1710793068; a=rsa-sha256; cv=none; b=k/m80Kb1W2MlBMm9wXSIh1q/CRYjwHvvoDPwUVJ83lp/9kd95cKRdyRB55EY5DkSkzed7p bqvF51vsEBgIdsTFGxS8mGKHB4JdMEsPEbJZeDjOhA/jZ57+xyuAb8IJel30FPjIj62A4l 57fmwDSY5zJGBk8ZHWtVOM5gug/ZZMvumy/zZ2rWCUZXCoc1LfhsZptSelRyh9V45I3pdL csZdvYJfGDOWKK1oGcINFjsswkgJusMMw7uDBzaLqBtuQgx7VkEVdENQX4VhD8PkYgI1oV 9zqDtAJ0K8Nl00NGb6Q0A17sKju4pDLaxcmvH3QCjolZKmA0OjXrFS+LeIBj9Q== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 6C28720561 for ; Mon, 18 Mar 2024 21:17:48 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rmJQ5-0005uK-SN; Mon, 18 Mar 2024 16:17:17 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rmJQ4-0005tx-0i for guix-devel@gnu.org; Mon, 18 Mar 2024 16:17:16 -0400 Received: from wfout7-smtp.messagingengine.com ([64.147.123.150]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rmJQ0-00006u-Sk for guix-devel@gnu.org; Mon, 18 Mar 2024 16:17:15 -0400 Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailfout.west.internal (Postfix) with ESMTP id D93CB1C000AE; Mon, 18 Mar 2024 16:17:08 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute6.internal (MEProxy); Mon, 18 Mar 2024 16:17:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=retrospec.tv; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm1; t=1710793028; x=1710879428; bh=rpmGe5Z+74UL8QRRx+iGLGqLjEx7lhfr46elWM2/icE=; b= WXRMaZkGB09jK1hh5Hz9L/m1rz5l47G2JL9nDjHKEXp8VUPCqak2mPwWbqnAGimq ge24p1KxW/XenfOOLEPYx2Tgjv6ajSC1fUm56RojbvrOEjbc13r+/R+mh8n9Krsb WlKJYvGmaxZwbuCgYoPwfGHWmVyEFJYpvjSwcZUuRUXqMA+czg9WzTZJw0HWosF0 ftN+A4z+RuHsRkle9CXcYqdCd24oZaqYnHGcxh1qIU8T2oihu7hbbH7vGZpMBzSE 70QKrHAiP470PA3Mb+NWsZw2iEHKr0DscyOYuvhddHdlVw99uPxFY5tGYd1EgagT rmNI9HukTvCk9jtUyJpXTw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1710793028; x= 1710879428; bh=rpmGe5Z+74UL8QRRx+iGLGqLjEx7lhfr46elWM2/icE=; b=M VJgRzhZSqt6UEFdaYem/qsEWOdkq9NyhNLVoWX2RSXGfl1D0zhQ+VV2QUzNZu6a8 SDmLBJL2xtnCVOxbq67NymzfzQvcT26vi9tH1CT3wfS/4goB8h/o9HTe5U2Ql4V4 rjDxGfkf+jk74iYYUxBIkm1TUwwwOYjpgNzUifzZuaf/Yp/XJAUN+2yC8RkSUqbr 5izT5f3QLOKzc/ys7gA1yzISy2HVwq/WMu6L2ey7SHo5jMSPyKoEB67sJqwd10gy dClB7PYIBVGYQOi+FtW4NCIVgKnuoAhPlHvJ6YlkPwlh2ebeZkjsU7nrlZUOMUuv KED3mmg50YYxI1NXEi0pQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrkeejgddufeehucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpehffgfhvfevufffjgfkgggtgfesth hqredttderjeenucfhrhhomhepkfgrnhcugfhurhgvuceoihgrnhesrhgvthhrohhsphgv tgdrthhvqeenucggtffrrghtthgvrhhnpeffudegfefftdfgtdfgveeuleefhedvkeekue eludfhudejteejgeefjeduhfdtffenucffohhmrghinhepshhofhhtfigrrhgvhhgvrhhi thgrghgvrdhorhhgpdgrrhigihhvrdhorhhgpdhhuhhgghhinhhgfhgrtggvrdgtohdpgh hithhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghi lhhfrhhomhepihgrnhesrhgvthhrohhsphgvtgdrthhv X-ME-Proxy: Feedback-ID: id9014242:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 18 Mar 2024 16:17:07 -0400 (EDT) References: <87il1mupco.fsf@meson> <87a5mvyjl4.fsf@gmail.com> User-agent: mu4e 1.8.13; emacs 28.2 From: Ian Eure To: Simon Tournier Cc: guix-devel Subject: Re: Concerns/questions around Software Heritage Archive Date: Mon, 18 Mar 2024 12:38:18 -0700 In-reply-to: <87a5mvyjl4.fsf@gmail.com> Message-ID: <87ttl3thvh.fsf@meson> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=64.147.123.150; envelope-from=ian@retrospec.tv; helo=wfout7-smtp.messagingengine.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Scanner: mx12.migadu.com X-Migadu-Spam-Score: -9.49 X-Spam-Score: -9.49 X-Migadu-Queue-Id: 6C28720561 X-TUID: 7O5ktRNXf1S0 Simon Tournier writes: > Hi, > > On sam., 16 mars 2024 at 08:52, Ian Eure =20 > wrote: > >> They appear to be using the archive to build LLMs:=20 >> https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcode= r2/ > > About LLM, Software Heritage made a clear statement: > > https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-= code > > Quoting: > > We feel that the question is no longer whether LLMs for=20 > code > should be built. They are already being built,=20 > independently of > what we do, and there is no turning back. The real=20 > question is > how they should be built and whom they should benefit. > > Principles: > > 1. Knowledge derived from the Software Heritage archive=20 > must be > given back to humanity, rather than monopolized for=20 > private > gain. The resulting machine learning models must be made=20 > available > under a suitable open license, together with the=20 > documentation and > toolings needed to use them. > > 2. The initial training data extracted from the Software=20 > Heritage > archive must be fully and precisely identified by, for=20 > example, > publishing the corresponding SWHID identifiers (note=20 > that, in the > context of Software Heritage, public availability of the=20 > initial > training data is a given: anyone can obtain it from the > archive). This will enable use cases such as: studying=20 > biases > (fairness), verifying if a code of interest was present=20 > in the > training data (transparency), and providing appropriate=20 > attribution > when generated code bears resemblance to training data=20 > (credit), > among others. > > 3. Mechanisms should be established, where possible, for=20 > authors to > exclude their archived code from the training inputs=20 > before model > training begins. > > I hope it clarifies your concerns to some extent. > It doesn=E2=80=99t clarify them, but it does illustrate them. HuggingFace and the StarCoder2 model is in violation of principle=20 2. By their own admission, they are including code without clear=20 licensing[1]: The main difference between the Stack v2 and the Stack v1 is=20 that we include both permissively licensed and unlicensed files. HuggingFace=E2=80=99s StarChat2 Playground[2] also violates this=20 principle, as it outputs code without any license or provenance=20 information; I know, because I tried it. While their own terms of=20 use for StarCoder2 state: Any use of all or part of the code gathered in The Stack v2=20 must abide by the terms of the original licenses... ...their own playground makes this impossible. HuggingFace is also in violation of the third principle, because=20 they haven=E2=80=99t established a functioning opt-out model[3]. Opting=20 out requires using non-free software; requests have been sitting=20 for nearly a year with no action or response; and out of every=20 request submitted, only a single one has *ever* been honored. They appear to be violating free software licenses on large scale.=20 They are in violation of SWH=E2=80=99s own positions. > Moreover, you wrote: =C2=AB I want absolutely nothing to do with=20 > them. =C2=BB > > Maybe there is a misunderstanding on your side about what =E2=80=9Cfree > software=E2=80=9D and GPL means because once =E2=80=9Cfree software=E2=80= =9D, you cannot=20 > prevent > people to use =E2=80=9Cyour=E2=80=9D free software for any purposes you d= islike. > > If you want to bound the use cases of the software you create,=20 > you need > to explicitly specify that in the license. And if you do, your=20 > software > will not be considered as =E2=80=9Cfree software=E2=80=9D. > > That=E2=80=99s the double sword of =E2=80=9Cfree software=E2=80=9D. :-) > I am crystal clear on the meaning of free software. I wish to=20 remove it from these models *in order to* keep it free. Thanks, =E2=80=94 Ian [1]: https://arxiv.org/html/2402.19173v1 [2]:=20 https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground [3]: https://huggingface.co/datasets/bigcode/the-stack-v2 [4]: https://github.com/bigcode-project/opt-out-v2/issues