From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id 4EXPCluMfWaO4gAAe85BDQ:P1 (envelope-from ) for ; Thu, 27 Jun 2024 15:59:23 +0000 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id 4EXPCluMfWaO4gAAe85BDQ (envelope-from ) for ; Thu, 27 Jun 2024 17:59:23 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=retrospec.tv header.s=fm1 header.b=h2MqWnFu; dkim=pass header.d=messagingengine.com header.s=fm2 header.b="e 2DlmZy"; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1719503963; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=Q8PGymJwdru6X1dUf6Ofpb4/tR3TjjhpTcImUH5qJEM=; b=DRE29h0U5nkilDrW1bn5571vOLQh4+0piu/T4r74KJNsk7OST4UWDgQ/n5t44y7/mZNuWK 07qHPc3v0J4XUTx9IQCa3eU31cjD+VKAnlZkAV9AzVKNuGRuhofEHhWlKfqBsFhFaXeoPl slnKS9a6pS7FjPH8gFd34xycvxfRHH5kyKzRiY1r199idUwmGo4fQBYJwkRVDBN/V40TVp U5ZES3dC0vWb4kIqdYf1LO9qjtPb3M4iyWlNGhteSeEB2gj+y+LYyHIwYr5uICfgWFvfdi bh/6QYaTrXGQfQlb8mC+v3tGemUi+7CyyuCRMnHoSkKLaMINe7gNzvSpmhiUmA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1719503963; a=rsa-sha256; cv=none; b=Px4h4fPqZHDKuEfIIjhf0/Dg1i2q1LQ3hSRdDus5gTZFqQVO8ewYRnv1R4iDn2Q8x9gInf EotqcV2ueM07I86y6RoM9dmB8jjWv9TlbDOGmqG1xciEdN/jxSp8+BpaviScLoT8IS910d gJsqIOpE+k5H2GopBeFTV5zSbKEG8k29wb+/5vGc2Sf3EcZz69f2cn166Ulx6iDCe0/Kju 2PSG5rEWAgEj2HDWcqDHZOYkEOPMSvdqoX5BVYF/YMIlxHPTNgPYJkAbrPjjbUGBcy7dii q/x2m44m6Q/v/grT2W3eTdNPJfH10m5G/4YQGs2MTEqyLTI2smW4YFJHNhhS1w== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=retrospec.tv header.s=fm1 header.b=h2MqWnFu; dkim=pass header.d=messagingengine.com header.s=fm2 header.b="e 2DlmZy"; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id B22C31656C for ; Thu, 27 Jun 2024 17:59:22 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sMrW7-0005Ks-L1; Thu, 27 Jun 2024 11:58:35 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sMrW4-0005KT-JB for guix-devel@gnu.org; Thu, 27 Jun 2024 11:58:32 -0400 Received: from fout8-smtp.messagingengine.com ([103.168.172.151]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sMrVw-00068C-Aj; Thu, 27 Jun 2024 11:58:29 -0400 Received: from compute7.internal (compute7.nyi.internal [10.202.2.48]) by mailfout.nyi.internal (Postfix) with ESMTP id 373511380032; Thu, 27 Jun 2024 11:58:22 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute7.internal (MEProxy); Thu, 27 Jun 2024 11:58:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=retrospec.tv; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm1; t=1719503902; x=1719590302; bh=Q8PGymJwdru6X1dUf6Ofpb4/tR3TjjhpTcImUH5qJEM=; b= h2MqWnFu/h4Y5px2QzZE0wAdBY/B31ufuaK1X7JH+syGgShJwvPv0cXRJ+XT1/v9 /TlqUPiK5ybOkubd3XsTz/wGRB4gEMCGMHWZqMZbSV9xJ73FCQcLKX8awoxeHq50 af3FPdeifBAURWHtV9YlWM5VlBni7Hs4GmzzgJOaKhS6hN9odf0vKuXXyZfJlwbd M6PbossZUiEr1YazSPSOSr0J2I1tORXKT0mW06GD0TJFxvYC2/s7p6+KnZxOo6F9 yfHe0+6aV/XaNE6TlTbhY7RI9XkLsZo+Y2MfzNo81/k3D01+SZBN7I2pgp5HT3zv Ki7cRM3g0b8EGr34V9HKKQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1719503902; x= 1719590302; bh=Q8PGymJwdru6X1dUf6Ofpb4/tR3TjjhpTcImUH5qJEM=; b=e 2DlmZyIC4S5/vQ7MR7t3jDkfB2/4K2GY2nUYyKpdv0Y3PKoz7pERzUH4eIMA/Lix v9JQ5CRjvzTRtPOZkwpfIdOc+uIwIRbVjpLsOiveBJbj6mL9sE0jXT8lHdLJdHNV ZUI8xnii8BU50D06YuHBJfMTy8uF1pf/qNiOpWSuiCWKFmheM5dtvnnf4izoiUhP S7D21+EJB/sRVJuuWZJlE+7BuEnore5/2T7vuL1OjOa9bfuuXXiL1DvLMM9cPuFd 46UjweLMAmyIr4Alykz+bmu3zmA5GjlsJbSLDSLtGCVcXlx8SaVRX/4HDUfYKmx1 oI1Rm1r5w0qeviO+7lKYA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddrtdeggdeliecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpehffgfhvfevufffjgfkgggtgfesthhqredttderjeenucfhrhhomhepkfgrnhcu gfhurhgvuceoihgrnhesrhgvthhrohhsphgvtgdrthhvqeenucggtffrrghtthgvrhhnpe dvtdekjeekhfeuieekffehjefggeefhfehvefftdetteeigeduledvveffjeelheenucff ohhmrghinheprghrgihivhdrohhrghdphhhughhgihhnghhfrggtvgdrtghonecuvehluh hsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepihgrnhesrhgvthhr ohhsphgvtgdrthhv X-ME-Proxy: Feedback-ID: id9014242:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 27 Jun 2024 11:58:21 -0400 (EDT) References: <87tthq3yr5.fsf@meson> <87r0ci7eq1.fsf@gnu.org> User-agent: mu4e 1.8.13; emacs 28.2 From: Ian Eure To: Ludovic =?utf-8?Q?Court=C3=A8s?= Cc: guix-devel@gnu.org Subject: Re: Next Steps For the Software Heritage Problem Date: Thu, 27 Jun 2024 08:30:39 -0700 In-reply-to: <87r0ci7eq1.fsf@gnu.org> Message-ID: <87ed8i4btv.fsf@meson> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=103.168.172.151; envelope-from=ian@retrospec.tv; helo=fout8-smtp.messagingengine.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Queue-Id: B22C31656C X-Migadu-Scanner: mx12.migadu.com X-Migadu-Spam-Score: -9.55 X-Spam-Score: -9.55 X-TUID: +lA1gxVer0fG Hi Ludo, Ludovic Court=C3=A8s writes: > Ian Eure skribis: > >> Guix sends archive requests to SWH. SWH gives that source code=20 >> to >> HuggingFace. HuggingFace demonstrably violates the licenses. > > Which licenses? As has been said previously, and you can verify=20 > for > yourself, it does not ingest code under copyleft licenses. > While this is what their paper claims[1], it doesn=E2=80=99t appear to be=20 true, since I can see my own GPL=E2=80=99d code in the training set. I=E2= =80=99ve=20 since moved nearly all of my code off GitHub, but if you visit=20 their "Am I in The Stack?" page[2] and enter my old username=20 ("ieure"), you will see pretty much every repository I ever hosted=20 there, including both unlicensed and GPL=E2=80=99d code. Some examples=20 are hyperspace-el, nssh-el, tl1-mode, etc. While there aren=E2=80=99t=20 LICENSE files in those repos, the file headers of all clearly=20 indicate that they=E2=80=99re GPL=E2=80=99d. Unfortunately, there is no way to check for the presence of code=20 in the training set except by GitHub username. What I don=E2=80=99t know for certain is whether these are in the training= =20 set because they came from SWH, or because HuggingFace obtained=20 them through other means. Given that all the links for my GitHub=20 username on that "Am I in The Stack" link back to SWH, it seems=20 very likely that it came from them. Thanks, =E2=80=94 Ian [1]: https://arxiv.org/pdf/2402.19173 "We also exclude=20 copyleft-licensed code..." [2]: https://huggingface.co/spaces/bigcode/in-the-stack