From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id gM5TA+8++GWdSgAAe85BDQ:P1 (envelope-from ) for ; Mon, 18 Mar 2024 14:17:35 +0100 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id gM5TA+8++GWdSgAAe85BDQ (envelope-from ) for ; Mon, 18 Mar 2024 14:17:35 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CTgpHLXd; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1710767855; a=rsa-sha256; cv=none; b=OZvTYC1wtAbDtEy4Zr3x+8aTLbNqrArHPh7WUNdB6Z9melSiU/szab9HPwd/JxoUPWgvlZ 4svM51pZ99dn1MLV8OU1q2a02PVxabR2RxPU2JsKMeNSdlWahwGe0Gzj6F8fvRdNUR0Cmy Br1TNRXtZifC8Sve+TL8wyt0KTHbfr0F7oBoFtxsd+KGKZX2WKJwYKCwzbPfR5YbuIE8+o b1lB77hJL3czMGTwIa78/qSmGUShJdXCPvfCwPzs6oiGPVbyPOIbotHMnGvHjkQcPGdRzN eTh9yjsokMecdKmAeNDZNTntTdCpNeTTS8Ytw65QONbhW+Ei3yKLccGgzej47Q== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CTgpHLXd; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1710767855; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=xJOv3OlocDPdhIZMpvYB313Ub2ruzlMnScWYCM9HShc=; b=lToVH39FNlQCyjUQ0u9d65Sv9tf7S1APmGH6CoVmtF3zkjXikV9xajjE4vo8dTjANP/weA SzSdHB4mdW+soGDPi+d/ao0XZ4Kqr49XUpysswU+a8IZWQTpntfygAvGeMA/U1hNy2MaY3 3gV5RBJClaV3U0OgZXCNwiRxKC/5PIZsOUXqxpjztTPbfCB02F3qJrscZS9Ra8Q+1n0fAM rhcHrTSa7hb/SJ7Oty0EgLMcMn6cMMXLz5mW9LrUZEruTaM1cXqHcapUcVsAlc+8117QNv e5DT8xm357JGxiWlCtD56fO+HF2L86NQ4ZU/8M9hzerpDjE3LKlxPzjT00gJoA== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 76C35210AD for ; Mon, 18 Mar 2024 14:17:34 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rmAz1-0006va-V2; Mon, 18 Mar 2024 07:16:48 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rmAxV-0004oZ-7T for guix-devel@gnu.org; Mon, 18 Mar 2024 07:15:13 -0400 Received: from mail-wm1-x334.google.com ([2a00:1450:4864:20::334]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rmAxR-0001hY-VV for guix-devel@gnu.org; Mon, 18 Mar 2024 07:15:12 -0400 Received: by mail-wm1-x334.google.com with SMTP id 5b1f17b1804b1-4140d2f7a68so226375e9.1 for ; Mon, 18 Mar 2024 04:15:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710760507; x=1711365307; darn=gnu.org; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=xJOv3OlocDPdhIZMpvYB313Ub2ruzlMnScWYCM9HShc=; b=CTgpHLXdcL/X5fjofMfN5mXlYrq1QJ9ISN7r2FJl2psELjeItV8v9DllDbzvPuMBbi ctAEABwwn7HdsJBxxF5mOhf768OrYxX39IgKC204w+5qen2BeWx8EObL0TQhrXQ3MzZS MlrlyiAoWaBnv35+DryyGE7OrUCz2rhfEdYA1Q5+gsGrsgU143xqryXJm4f49591LZ/P YLHVdBpVgIMgY1oDa1Y4GT3VWUzmOa02/7xZa06YbpUykwwrb9MjOzFpkx5eQcNvhi0X 8VIqZQ8BPUvq3Qe64poOlfDI68+ej+jGkk9bUF4WAUARbILXwEWZen31JNxhss2rsA13 j00Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710760507; x=1711365307; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:to:from:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=xJOv3OlocDPdhIZMpvYB313Ub2ruzlMnScWYCM9HShc=; b=OC6m42fTwKmyj38Dn4748tnTOS5bPMHZFpV4tCbq2CrfXBmF/Rnr0yv9/z8xRXAF0G uG6mGBh1Rx7nIOjtwC8pA3AHdGvCcUGvY03LQW0WxP6kZWKPDdBv9RxvhrNubQEWr83K /dPqk5aFHR6miwHhv8IIoq26e8jfsiSLqjkRbpQkVAMO+W5VkqBSSeiaaD0FcrCgm2HM KjuZriLybing4Z6Rvp8kYBnKI9ZNp+UbYbpiQnvPiKTjv2DK4SzYd3jD7GJkos+NlCdX REyrv2dzSuDClXkTEh8b4qh6QetRpLewsmgs3n6R83Veo5TumG3+ivUcGlnfwyTFkboz Y2ew== X-Forwarded-Encrypted: i=1; AJvYcCUa4yjOYz6GVFFnt8j04qxSbuctxVK6U+XU/jZ4BFt/Vj6evcFFYE50QemzkwO0S7NSdXjCMnULLvi7pP5oIhPhpck= X-Gm-Message-State: AOJu0YxyVgGikRxHXLrRsbdP7XWKe2SzlUacPGINycy0p3n6m9UIKthl pHiz3Naz4YhzJlNdAhcPRXR0DdWuyoSNTBVWBANGVyN0IxK3NCuvoN9G8pDu X-Google-Smtp-Source: AGHT+IGRHEcrVc5vgsvPFdBqa24i/7XvHD1mvEtawhKo8SxR9akO2O0kNPi7w36u50T2UpTM5vq9og== X-Received: by 2002:a05:600c:3b89:b0:413:f1c5:83c6 with SMTP id n9-20020a05600c3b8900b00413f1c583c6mr8878422wms.1.1710760506750; Mon, 18 Mar 2024 04:15:06 -0700 (PDT) Received: from lili (roam-nat-fw-prg-194-254-61-41.net.univ-paris-diderot.fr. [194.254.61.41]) by smtp.gmail.com with ESMTPSA id j3-20020a05600c1c0300b004131310a29fsm14638163wms.15.2024.03.18.04.15.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Mar 2024 04:15:06 -0700 (PDT) From: Simon Tournier To: Ian Eure , guix-devel Subject: Re: Concerns/questions around Software Heritage Archive In-Reply-To: <87il1mupco.fsf@meson> References: <87il1mupco.fsf@meson> Date: Mon, 18 Mar 2024 10:28:55 +0100 Message-ID: <87a5mvyjl4.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::334; envelope-from=zimon.toutoune@gmail.com; helo=mail-wm1-x334.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Scanner: mx10.migadu.com X-Migadu-Spam-Score: -6.75 X-Spam-Score: -6.75 X-Migadu-Queue-Id: 76C35210AD X-TUID: p+OAyglAsQLn Hi, On sam., 16 mars 2024 at 08:52, Ian Eure wrote: > They appear to be using the archive to build LLMs:=20 > https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder= 2/ About LLM, Software Heritage made a clear statement: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-co= de Quoting: We feel that the question is no longer whether LLMs for code should be built. They are already being built, independently of what we do, and there is no turning back. The real question is how they should be built and whom they should benefit. Principles: 1. Knowledge derived from the Software Heritage archive must be given back to humanity, rather than monopolized for private gain. The resulting machine learning models must be made available under a suitable open license, together with the documentation and toolings needed to use them. 2. The initial training data extracted from the Software Heritage archive must be fully and precisely identified by, for example, publishing the corresponding SWHID identifiers (note that, in the context of Software Heritage, public availability of the initial training data is a given: anyone can obtain it from the archive). This will enable use cases such as: studying biases (fairness), verifying if a code of interest was present in the training data (transparency), and providing appropriate attribution when generated code bears resemblance to training data (credit), among others. 3. Mechanisms should be established, where possible, for authors to exclude their archived code from the training inputs before model training begins. I hope it clarifies your concerns to some extent. Moreover, you wrote: =C2=AB I want absolutely nothing to do with them. =C2= =BB Maybe there is a misunderstanding on your side about what =E2=80=9Cfree software=E2=80=9D and GPL means because once =E2=80=9Cfree software=E2=80= =9D, you cannot prevent people to use =E2=80=9Cyour=E2=80=9D free software for any purposes you dis= like. If you want to bound the use cases of the software you create, you need to explicitly specify that in the license. And if you do, your software will not be considered as =E2=80=9Cfree software=E2=80=9D. That=E2=80=99s the double sword of =E2=80=9Cfree software=E2=80=9D. :-) Cheers, simon