From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id GMkgMCLac2aWCwAAe85BDQ:P1 (envelope-from ) for ; Thu, 20 Jun 2024 07:28:35 +0000 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id GMkgMCLac2aWCwAAe85BDQ (envelope-from ) for ; Thu, 20 Jun 2024 09:28:34 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=none ("invalid DKIM record") header.d=msavoritias.me header.s=20210930 header.b=bO2kdJGR; dmarc=fail reason="SPF not aligned (relaxed)" header.from=msavoritias.me (policy=none); spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1718868514; a=rsa-sha256; cv=none; b=FnUPyy8LEWUUZMI5b8/r4u4A6vcjMcl4NA6p429SH0UkDo4VTgN3Hx+0Vcea+LUhAOQKnU Tcyj9Cvit3KJUGxCJB+erV6dPZp4KnWibV4Q9RBpfCq/cnFlpZkUgNgk2/DvEZNCvc9UsP EiRnBl0P2I+GtLIdy7rFWu1tL8GE1XgSgLVH8yz4+6V+bVABjO4KNZ8ue/K9QfBRf7Mbto NamhDL5Ba2nv2+tOlx5y+KYEosN7iXoLHPyNdxmhG/Dv183q0tbzUmU0r8+4GgQwycQJGq S1igfu7zMUui8QxhLuUjtRKGn/PhP6NWo7ok7e/S2Q7KYtvdwfyQJ6txkyVs/w== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none ("invalid DKIM record") header.d=msavoritias.me header.s=20210930 header.b=bO2kdJGR; dmarc=fail reason="SPF not aligned (relaxed)" header.from=msavoritias.me (policy=none); spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1718868514; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=3WuW0J5MWEbcR6/q3tSJd50ypsjIGK/pubjPfrJ3FDI=; b=SQLVelV5o1YqFQOW5Yjkm+uFtn3SfOgtOwq/rw4dZ9Dw1D25NhcoPWKvWlmnh/59rnUI/1 psuhUU25RWhqpUXfIhBWb+/9v+wR4M1UhUZwjPkdnHZzgFXAIhPJLskBzD1sph1+SdPHrr 6GH1lr+3Wc2dJ5PQ/rJXNAImpO9bBimmdCcJMii5d1Ugqk5vtI7SYPNPADEHOBlGzjEtF2 8LrMNkDmipnRem0z3T4qhmQgeNtjSsdEzbWco0Id2ME1oNTPsxUwzv8PBvTM/jjDTqWES9 avyOYsHZf5ZXp9TX++0E1g+xEUr2iuV4t+8fL5lMDXB5zXgaSsHweVLEIEK7dA== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 9EFCA34C87 for ; Thu, 20 Jun 2024 09:28:34 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sKBdn-0004UD-9m; Thu, 20 Jun 2024 02:51:27 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sKBdl-0004Tr-VI for guix-devel@gnu.org; Thu, 20 Jun 2024 02:51:25 -0400 Received: from mail.webarch.email ([81.95.52.48]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sKBdk-00057T-2Q for guix-devel@gnu.org; Thu, 20 Jun 2024 02:51:25 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) by localhost (Mailerdaemon) with ESMTPSA id 26E581A8871C; Thu, 20 Jun 2024 07:51:19 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=msavoritias.me; s=20210930; t=1718866279; h=from:subject:date:message-id:to:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:references; bh=3WuW0J5MWEbcR6/q3tSJd50ypsjIGK/pubjPfrJ3FDI=; b=bO2kdJGRCs7xcvXxnyoCs4SJHgsbE+ezW0IEH0kFu2se8WMp7Tu9y0Z895s9Flb9j0y+zl fVQ+WQVaQ1JsL+thvPuajG3b+Z7cDV4Q86hJwDA/qdsdH+Cw1kL1Xy+WdSTgMWKtI3FxQy 4HtJSlbK0CIpcod3pUF6wi/laEXBtbUENwoYQpW6hNY18HhByUxlQ/Jj/UaAheorTqjwqC khjrflKcKitD9IiiLhBAjeFjEBU5cFtuEcyNBWF2zxvnKN+yRhVOuTOJNC7zf+vb1LgB+C 5aXRAJkDZ2kU9QPlDX2zbUzcMBjjinIwStWcWsEabCTW/Mzs+q/lBjM+sOPvBw== Date: Thu, 20 Jun 2024 09:51:17 +0300 From: MSavoritias To: Simon Tournier Cc: Ian Eure , guix-devel@gnu.org Subject: Re: Next Steps For the Software Heritage Problem Message-ID: <20240620095117.6b3d3b3b@fannys.me> In-Reply-To: <87plsd9eqq.fsf@gmail.com> References: <87a5jh74jf.fsf@gmail.com> <20240619121338.71b5f340@fannys.me> <87plsd9eqq.fsf@gmail.com> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Last-TLS-Session-Version: TLSv1.3 Received-SPF: pass client-ip=81.95.52.48; envelope-from=email@msavoritias.me; helo=mail.webarch.email X-Spam_score_int: -16 X-Spam_score: -1.7 X-Spam_bar: - X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, DKIM_INVALID=0.1, DKIM_SIGNED=0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Spam-Score: -6.26 X-Migadu-Scanner: mx12.migadu.com X-Spam-Score: -6.26 X-Migadu-Queue-Id: 9EFCA34C87 X-TUID: JlQyUZTw/jUl On Wed, 19 Jun 2024 16:41:33 +0200 Simon Tournier wrote: > Hi MSavoritias, all, >=20 > Let me provide more context. >=20 > The concern started couple of months ago, to my knowledge. And > discussion is still on going. So I think that=E2=80=99s incorrect to say= =E2=80=9Cany > result for over 6 months=E2=80=9D. Hey Simon, I was talking about the perspective of a guix person that is not part of maintainers or any mailing lists that these discussions are happening. So from my side there hasn't been any updates from SWH or from Guix either for the named issue or the LLM issue. > Moreover, I feel you have a misunderstanding about HuggingFace and SWH > partnership. From the reading of public information, HuggingFace and > BigCode trains on a subset of SWH source code archive. I mean, it is > a snapshot and to my knowledge, they provided the list of source code > that had been used for training. >=20 > Not to avoid the question but from a pragmatic point of view, one > might ask if the source code you write and do not want to be included > in the training dataset, if this source code is concretely part of > that training dataset. >=20 > HuggingFace is not training continuously with source code from SWH. >=20 > And technically, SWH is an archive i.e., the code is not stored hot. > I do not know and I have not read all details by HuggingFace of their > method; i.e., which kind of data they process =E2=80=93 independent unique > files, complete repository, etc. What I know is that the piece when > fetching from SWH is named SWH Vault; it requires to =E2=80=9Ccook=E2=80= =9D and > prepare all the files that take times, from minutes to days. Thats all fair and valid. Sadly tho SWH: - Doesn't even mention on their website anything about what happens to my code and where. so there is provenance. (unless i start searching HuggingFace. - The email from the director that was sent to me says explicitly that they don't see an issue with it being opt-out after the fact and embrase LLMs usage. So that seems to me that its already in there.=20 > All that to say two key points: >=20 > 1. People behind SWH are well-aware about various sides of the > concerns. As said, they are long-time free software supporters. Be > sure they have eared community concerns. Some discussions are still > pending because as explained, all sides of ethical questions needs to > be cautious. >=20 > Please do not think it is ignored. >=20 >=20 > 2. FWIW, I am in touch with SWH people =E2=80=93 among other members from= Guix > community. For instance, in order to feed the discussion, Roberto > from SWH pointed to me this blog point by Bruce Perens: >=20 > https://perens.com/2019/10/12/invasion-of-the-ethical-licenses/ >=20 > Well, I do not know if the outcome will be aligned with your current > opinion, but be sure that your concerns as the others raised by Guix > community members are taking into account. Thank you for giving me an honest and detailed answer. I wish I could say this was encouraging but as things currently stand I would like much more transparency about what is actually happening from Guix and SWH. Because currently: - The director seemed completely oblivious to any issues with LLMs or code harvesting without consent. - Efraim seemed to have suggested that there hasn't been any communication and its even offtopic. - Nothing has been written from Guix or SWH publicly about it and there are no mechanisms in place in the short term even to mitigate some of these things. (Which my next steps try to fix when I make the patches in a few weeks) Regards, MSavoritias =20 > Cheers, > simon