From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id 4DgtFJk/dGa7AwAAe85BDQ:P1 (envelope-from ) for ; Thu, 20 Jun 2024 14:41:29 +0000 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id 4DgtFJk/dGa7AwAAe85BDQ (envelope-from ) for ; Thu, 20 Jun 2024 16:41:29 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cBIG7GnR; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1718894489; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=MO3X/SoZlwlfE0Y1Qjw7kemJpGeoPa3rT6XOb3hdELs=; b=kX5m9vXKNannCNmR0XDFYJkhw29HY9DSWST+R6WBTKTq51tYSP6HOMqJudM/0fjB23AlkY 4tno3Biw9ooc4L9UBDbnDqfAyK1oIKeXfL3uVqin5M7cwhVnj99h+g46CZ9jFGFNluVvF8 jSi1bB5IPq1tbfC5TEGi+5BY/Av1hZDCtf60+ETdTkqn+0a9WzOuDoRRxNyCeBfDItFxSf YTp9OMvcxaDbrAhHRz718t++nsHxCtEHKanynHxcQvW3C1UfPD6j7/Zxelm6BqtlEVPIii oYlI/0ekD+qVHbcytIKcyZ1DwcKOE0F58nx40Xw0oazx0C3nDt1JCIlxfI/zhQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cBIG7GnR; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=key1; d=yhetil.org; t=1718894489; a=rsa-sha256; cv=none; b=MDinP7ZICcreRfHIUibMaQZX3d9Tvw/aJ1jCyOHAMYTHmnyE7MZJpCWeZJseHd3q+AUrb9 aNVkw53LYBlN1mOEqSmVc1SZ63VAlwBnIwcPSX5ekcaDYV/DIf3mZFO2zCURG+zoE5t7nQ wfDDBmHH8lOzBXLx6CPKOoU9tEpC81QDQPf1PmEp+8nWRHg2S73GYnHdJ5+chwGnxmRYjS XspiflYj5+ZqWN4j9SZF2w4JLXFKN2Lgj9ZR35hwH4uGqX6kp+tg5+YtKPI2mRp93fapfx PrfvTEUwbuf3r4Ofxm4Vz8O6rm9JqBmdLq7o4k3IBTArdXsTmWI9YzBpeR+pfQ== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id E5E0225C29 for ; Thu, 20 Jun 2024 16:41:28 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sKIyJ-0002jE-Gg; Thu, 20 Jun 2024 10:41:07 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sKIyI-0002j1-5k for guix-devel@gnu.org; Thu, 20 Jun 2024 10:41:06 -0400 Received: from mail-wm1-x335.google.com ([2a00:1450:4864:20::335]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1sKIyG-0000oM-8L for guix-devel@gnu.org; Thu, 20 Jun 2024 10:41:05 -0400 Received: by mail-wm1-x335.google.com with SMTP id 5b1f17b1804b1-4247adb75a1so1187355e9.3 for ; Thu, 20 Jun 2024 07:41:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718894462; x=1719499262; darn=gnu.org; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id :reply-to; bh=MO3X/SoZlwlfE0Y1Qjw7kemJpGeoPa3rT6XOb3hdELs=; b=cBIG7GnRrXHy4D6BY/X7cMdizNLwVor0D/hRH5Wkw9n8g5Q6RiZ084diEGcrj6WdcT O0v09yc/HuHqisMu9csQOzq9yzfN1dxk7jc9Ye/e3zVqnnF1eP51c4Uh3MUcQhKay/FT R+F7JPk8eNyVmuNdjaGFXgE2tAeBM69wubuo+Nlll8QWTvoZv9gqJeDZo9kG4KNv380L UZDC6UAMfVxK2SQMhQZ+UIDMF9GLv0nTWX4gJQwGopYYoZErX6Hc0C6XtojmUm6FEBqY rYY913wbykNNhy7NcgCZhFyTTME3RiPh0PnEJfhIH4zOttqNtwVHKQX0SFVkfF9WNSDV sKVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718894462; x=1719499262; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MO3X/SoZlwlfE0Y1Qjw7kemJpGeoPa3rT6XOb3hdELs=; b=MMxW1lh+mBQHC4arDWd8N/2z14HyzdDUX9qYfrDxZ5EwIqJ77iYfgzA0WUYRgeqEE2 ZadFemQZAQFZMFji0nXUM3nvMXv5AGJgPZWei+Zq0pNDcgcGsYFP692645url7lMcuYl hZXJ1+MPbu92VVnEghlC8O9q8b1mGu33IswFXHp86+/4wggJOybUqangbm6EbKDkSdxt P37z1sf1dSHpSi7uUtQYANSWsy4/V1nXuMkAGM3lIVobqaHtkffTgLHukpzqA1HKqOCH 0wQf7fcnFaaGmHyqXG8Vb27A8nnxrMaLHE0XB0GLqplm1RjhSdPN38lWm2Js+2lxITyI 9URg== X-Forwarded-Encrypted: i=1; AJvYcCXZd3wvow6faFZAhUmhgtjqUJDwGwtTbH6oj7/KFk0IIimsIjiuUrzF3RK6vnUxcE4hd0yF17BckmaATxI26DqscE0= X-Gm-Message-State: AOJu0YydVGgeuiq90LyoqzqVGeuvhkR+HO4wghgRALZeaT1AlJXmpqCN T8vdBzI6wRMpbFsTb8oSRy3cu/7ZW4NaUZ6sNqu+mNWGlGucTBCrAuxnPA== X-Google-Smtp-Source: AGHT+IHlWicO2m+YvqvrY4gE9/6eHoRSj0VDRCR4/fYh8CnfguQv0GCM2UL777bxT85Yr61vFMoKpQ== X-Received: by 2002:a05:600c:1c8b:b0:423:146b:36f8 with SMTP id 5b1f17b1804b1-42478e41349mr30699815e9.4.1718894462016; Thu, 20 Jun 2024 07:41:02 -0700 (PDT) Received: from lili ([131.254.253.81]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4247d1f5dc6sm27165935e9.42.2024.06.20.07.41.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Jun 2024 07:41:01 -0700 (PDT) From: Simon Tournier To: MSavoritias Cc: Ian Eure , guix-devel@gnu.org Subject: Re: Next Steps For the Software Heritage Problem In-Reply-To: <20240620095117.6b3d3b3b@fannys.me> References: <87a5jh74jf.fsf@gmail.com> <20240619121338.71b5f340@fannys.me> <87plsd9eqq.fsf@gmail.com> <20240620095117.6b3d3b3b@fannys.me> Date: Thu, 20 Jun 2024 16:40:57 +0200 Message-ID: <871q4roex2.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::335; envelope-from=zimon.toutoune@gmail.com; helo=mail-wm1-x335.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Spam-Score: -9.70 X-Migadu-Queue-Id: E5E0225C29 X-Migadu-Scanner: mx10.migadu.com X-Migadu-Spam-Score: -9.70 X-TUID: gP1GjnwwX79p Hi MSavoritias, all, On Thu, 20 Jun 2024 at 09:51, MSavoritias wrote: >> Not to avoid the question but from a pragmatic point of view, one >> might ask if the source code you write and do not want to be included >> in the training dataset, if this source code is concretely part of >> that training dataset. [...] > Thats all fair and valid. Sadly tho SWH: > -=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 > there is provenance. (unless i start searching > HuggingFace. Being concrete and explicit, could you please share: 1. Which part of your code is included in the pretraining dataset? It=E2=80=99s easy, you can copy/paste a snippet and it returns the loca= tion from where it comes from. https://huggingface.co/spaces/bigcode/search-v2a 2. What is your code that is included in SWH archive? Again, it=E2=80=99s easy: checkout some commit of your repository, then inside this repository, you can run: echo "https://archive.softwareheritage.org/swh:1:dir:$(guix hash -S git= -f hex -H sha1 .)" Do not miss the =E2=80=99.=E2=80=99 (dot) once entering the repository.= This command returns SWHID. Other said, using this identifier, you might know if the repository is stored by SWH. (Be careful with temporary artifacts as .go files or else.) Or you can also check for one specific content: $ echo "https://archive.softwareheritage.org/swh:1:cnt:$(guix hash -S git= -f hex -H sha1 COPYING)" https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea5= 59a168bbcbb5e2 And the URL display the content of the file COPYING. Here GPL 3 license for instance. 3. Where such source code from #2 and #3 is packaged by Guix? =20 That said, if the source is hosted on GitHub or GitLab.com or SourceHut or CodeBerg or some other popular forges or even mirrored without your consent on one of these, please consider that your code had been ingested by ChatGPT without any mean to verify. Obviously, that=E2=80=99s = not an argument to accept the situation with HuggingFace and I understand that you do not want that your publicly release copyleft source code could be reused by any LLM. However, as said several times, rooting this willing of non-inclusion is larger than your own willing once you publicly released such source code under some copyleft license. I hope we agree on that. Again, I am not trying to avoid something. And again, we all have heard your points. Nothing is ignored. To my knowledge, the path forward is not yet well-defined. Since we are discussing at length with various different inputs, it means that a common understanding and/or opinion does not seem obvious. >> Well, I do not know if the outcome will be aligned with your current >> opinion, but be sure that your concerns as the others raised by Guix >> community members are taking into account. > > Thank you for giving me an honest and detailed answer. I feel you are pushy on the topic and for what my opinion is worth, it is not helpful to raise again and again that you want a way to opt-out. Yeah, people got it. :-) And you are probably not alone, I guess. It would help if you could provide a source code that your wrote and answer the three criteria above: included in pretraining dataset, included in SWH, packaged by Guix. I do not have special information from SWH but I am sure SWH people are working on the topic. And again, maybe the outcome will not be aligned with your opinion. Another story. Now, the other question you ask to Guix: do we continue to help SWH in harvesting? You propose to stop, IIUC. Ok, we got it, too. :-) From my point of view, the path forward is not to speak on the abstract but to root on concrete numbers; it would help in bounding what we are speaking about. Concretely, if you would like to be able to opt-out, could you point: 1. the piece from the Guix source code you are the author? 2. source code you are the author that is packaged by Guix? Again, I am not trying to avoid the discussion. Instead, I would prefer to root the discussion on concrete examples. Then it would appear to me easier to make progress. As Greg or Ekaitz also wrote: opting out has implications on the meaning of freedom behind =E2=80=9Cfree software=E2=80=9C. IMHO, that=E2=80=99s not because we would like to opt-out that we could, wo= uld be able to or allowed to. Therefore, instead of holding opinions on the abstract, let try to make progress and start on the concrete: which piece of source code are we speaking about? Cheers, simon =20=20=20