From: Ian Eure <ian@retrospec.tv>
To: Simon Tournier <zimon.toutoune@gmail.com>
Cc: guix-devel <guix-devel@gnu.org>
Subject: Re: Concerns/questions around Software Heritage Archive
Date: Mon, 18 Mar 2024 12:38:18 -0700 [thread overview]
Message-ID: <87ttl3thvh.fsf@meson> (raw)
In-Reply-To: <87a5mvyjl4.fsf@gmail.com>
Simon Tournier <zimon.toutoune@gmail.com> writes:
> Hi,
>
> On sam., 16 mars 2024 at 08:52, Ian Eure <ian@retrospec.tv>
> wrote:
>
>> They appear to be using the archive to build LLMs:
>> https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
>
> About LLM, Software Heritage made a clear statement:
>
> https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code
>
> Quoting:
>
> We feel that the question is no longer whether LLMs for
> code
> should be built. They are already being built,
> independently of
> what we do, and there is no turning back. The real
> question is
> how they should be built and whom they should benefit.
>
> Principles:
>
> 1. Knowledge derived from the Software Heritage archive
> must be
> given back to humanity, rather than monopolized for
> private
> gain. The resulting machine learning models must be made
> available
> under a suitable open license, together with the
> documentation and
> toolings needed to use them.
>
> 2. The initial training data extracted from the Software
> Heritage
> archive must be fully and precisely identified by, for
> example,
> publishing the corresponding SWHID identifiers (note
> that, in the
> context of Software Heritage, public availability of the
> initial
> training data is a given: anyone can obtain it from the
> archive). This will enable use cases such as: studying
> biases
> (fairness), verifying if a code of interest was present
> in the
> training data (transparency), and providing appropriate
> attribution
> when generated code bears resemblance to training data
> (credit),
> among others.
>
> 3. Mechanisms should be established, where possible, for
> authors to
> exclude their archived code from the training inputs
> before model
> training begins.
>
> I hope it clarifies your concerns to some extent.
>
It doesn’t clarify them, but it does illustrate them.
HuggingFace and the StarCoder2 model is in violation of principle
2. By their own admission, they are including code without clear
licensing[1]:
The main difference between the Stack v2 and the Stack v1 is
that we
include both permissively licensed and unlicensed files.
HuggingFace’s StarChat2 Playground[2] also violates this
principle, as it outputs code without any license or provenance
information; I know, because I tried it. While their own terms of
use for StarCoder2 state:
Any use of all or part of the code gathered in The Stack v2
must abide by
the terms of the original licenses...
...their own playground makes this impossible.
HuggingFace is also in violation of the third principle, because
they haven’t established a functioning opt-out model[3]. Opting
out requires using non-free software; requests have been sitting
for nearly a year with no action or response; and out of every
request submitted, only a single one has *ever* been honored.
They appear to be violating free software licenses on large scale.
They are in violation of SWH’s own positions.
> Moreover, you wrote: « I want absolutely nothing to do with
> them. »
>
> Maybe there is a misunderstanding on your side about what “free
> software” and GPL means because once “free software”, you cannot
> prevent
> people to use “your” free software for any purposes you dislike.
>
> If you want to bound the use cases of the software you create,
> you need
> to explicitly specify that in the license. And if you do, your
> software
> will not be considered as “free software”.
>
> That’s the double sword of “free software”. :-)
>
I am crystal clear on the meaning of free software. I wish to
remove it from these models *in order to* keep it free.
Thanks,
— Ian
[1]: https://arxiv.org/html/2402.19173v1
[2]:
https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground
[3]: https://huggingface.co/datasets/bigcode/the-stack-v2
[4]: https://github.com/bigcode-project/opt-out-v2/issues
next prev parent reply other threads:[~2024-03-18 20:17 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-16 15:52 Concerns/questions around Software Heritage Archive Ian Eure
2024-03-16 17:50 ` Christopher Baines
2024-03-16 18:24 ` MSavoritias
2024-03-16 19:08 ` Christopher Baines
2024-03-16 19:45 ` Tomas Volf
2024-03-17 7:06 ` MSavoritias
2024-03-16 19:06 ` Ian Eure
2024-03-16 19:49 ` Tomas Volf
2024-03-16 23:16 ` Vivien Kraus
2024-03-16 23:27 ` Tomas Volf
[not found] ` <EoCuAq3N681mOIAh7ptCyXiyscM9R0iPDBWId1eS4EbTJ2-ARWNfGuqtXIvmqcJNBl1SQvMM4X6-GiC5LiUv4TJv6J4ritPA3uZ2JBwkAzQ=@protonmail.com>
2024-03-16 23:40 ` Fw: " Ryan Prior
2024-03-16 17:58 ` MSavoritias
2024-03-18 9:50 ` Please hold your horses Simon Tournier
2024-03-16 21:37 ` Concerns/questions around Software Heritage Archive Ryan Prior
2024-03-17 9:39 ` Lars-Dominik Braun
2024-03-17 9:47 ` MSavoritias
2024-03-17 11:53 ` paul
2024-03-17 11:57 ` MSavoritias
2024-03-17 14:57 ` Richard Sent
2024-03-17 16:28 ` Ian Eure
2024-03-17 12:51 ` Tomas Volf
2024-03-17 23:56 ` Attila Lendvai
2024-03-20 15:25 ` contributor uuid (was Re: Concerns/questions around Software Heritage Archive) bae66428a8ad58eafaa98cb0ab2e512f045974ecf4bf947e32096fae574d99c6
2024-03-17 16:20 ` Concerns/questions around Software Heritage Archive Ian Eure
2024-03-17 16:55 ` MSavoritias
2024-03-18 14:04 ` pinoaffe
2024-03-17 13:03 ` Olivier Dion
2024-03-17 17:57 ` Ludovic Courtès
2024-03-20 17:22 ` the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive) Giovanni Biscuolo
2024-03-21 6:12 ` MSavoritias
2024-03-21 10:49 ` Attila Lendvai
2024-03-21 11:51 ` pelzflorian (Florian Pelz)
2024-03-21 11:52 ` pinoaffe
2024-03-21 15:08 ` Giovanni Biscuolo
2024-03-21 15:11 ` MSavoritias
2024-03-21 22:11 ` Philip McGrath
2024-03-21 16:17 ` pinoaffe
2024-03-21 15:23 ` Hartmut Goebel
2024-03-21 15:27 ` MSavoritias
2024-03-21 15:54 ` Ekaitz Zarraga
2024-03-22 4:33 ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-03-21 16:18 ` Efraim Flashner
2024-03-21 16:23 ` pinoaffe
2024-03-18 9:28 ` Concerns/questions around Software Heritage Archive Simon Tournier
2024-03-18 11:47 ` MSavoritias
2024-03-18 13:12 ` Simon Tournier
2024-03-18 14:00 ` MSavoritias
2024-03-18 14:32 ` Simon Tournier
2024-03-18 16:27 ` Kaelyn
2024-03-18 17:39 ` Daniel Littlewood
2024-03-18 20:38 ` Olivier Dion
2024-03-18 19:38 ` Ian Eure [this message]
2024-03-18 22:02 ` Ludovic Courtès
2024-03-19 10:58 ` Simon Tournier
2024-03-19 15:37 ` Ian Eure
2024-03-18 11:14 ` Content-Addressed system and history? Simon Tournier
2024-04-20 18:48 ` Concerns/questions around Software Heritage Archive Ian Eure
2024-05-01 15:29 ` Ian Eure
2024-05-01 15:41 ` Tomas Volf
2024-05-02 10:28 ` Ludovic Courtès
2024-05-09 16:00 ` Maxim Cournoyer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ttl3thvh.fsf@meson \
--to=ian@retrospec.tv \
--cc=guix-devel@gnu.org \
--cc=zimon.toutoune@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.