all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Ian Eure <ian@retrospec.tv>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: guix-devel@gnu.org
Subject: Re: Next Steps For the Software Heritage Problem
Date: Thu, 27 Jun 2024 08:30:39 -0700	[thread overview]
Message-ID: <87ed8i4btv.fsf@meson> (raw)
In-Reply-To: <87r0ci7eq1.fsf@gnu.org>

Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

> Ian Eure <ian@retrospec.tv> skribis:
>
>> Guix sends archive requests to SWH.  SWH gives that source code 
>> to
>> HuggingFace.  HuggingFace demonstrably violates the licenses.
>
> Which licenses?  As has been said previously, and you can verify 
> for
> yourself, it does not ingest code under copyleft licenses.
>

While this is what their paper claims[1], it doesn’t appear to be 
true, since I can see my own GPL’d code in the training set.  I’ve 
since moved nearly all of my code off GitHub, but if you visit 
their "Am I in The Stack?" page[2] and enter my old username 
("ieure"), you will see pretty much every repository I ever hosted 
there, including both unlicensed and GPL’d code.  Some examples 
are hyperspace-el, nssh-el, tl1-mode, etc.  While there aren’t 
LICENSE files in those repos, the file headers of all clearly 
indicate that they’re GPL’d.

Unfortunately, there is no way to check for the presence of code 
in the training set except by GitHub username.

What I don’t know for certain is whether these are in the training 
set because they came from SWH, or because HuggingFace obtained 
them through other means.  Given that all the links for my GitHub 
username on that "Am I in The Stack" link back to SWH, it seems 
very likely that it came from them.

Thanks,

  — Ian

[1]: https://arxiv.org/pdf/2402.19173 "We also exclude 
copyleft-licensed code..."
[2]: https://huggingface.co/spaces/bigcode/in-the-stack


  reply	other threads:[~2024-06-27 15:59 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-18 17:12 Next Steps For the Software Heritage Problem Andy Tai
2024-06-18 18:08 ` Ian Eure
2024-06-19 10:31   ` raingloom
2024-06-27 12:27   ` Ludovic Courtès
2024-06-27 15:30     ` Ian Eure [this message]
2024-06-27 16:48       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-27 16:58       ` Ludovic Courtès
  -- strict thread matches above, loose matches on Subject: below --
2024-06-28 18:01 Juliana Sims
2024-06-19  7:52 Simon Tournier
2024-06-19  9:13 ` MSavoritias
2024-06-19  9:54   ` Efraim Flashner
2024-06-19 10:25     ` raingloom
2024-06-19 15:46       ` Ekaitz Zarraga
2024-06-20  6:36         ` MSavoritias
2024-06-20 14:35           ` Ekaitz Zarraga
2024-06-21  8:51             ` MSavoritias
2024-06-19 10:34     ` MSavoritias
2024-06-19 14:41   ` Simon Tournier
2024-06-20  6:51     ` MSavoritias
2024-06-20 14:40       ` Simon Tournier
2024-06-21  9:08         ` MSavoritias
2024-06-18  8:37 MSavoritias
2024-06-18 14:19 ` Ian Eure
2024-06-19  8:36   ` Dale Mellor
2024-06-20 17:00     ` Andreas Enge
2024-06-20 18:42       ` Dale Mellor
2024-06-20 20:54         ` Andreas Enge
2024-06-20 20:59           ` Ekaitz Zarraga
2024-06-20 21:12             ` Andreas Enge
2024-06-21  8:41             ` Dale Mellor
2024-06-21  9:19               ` MSavoritias
2024-06-21 13:33                 ` Luis Felipe
2024-06-20 21:27         ` Simon Tournier
2024-06-18 16:21 ` Greg Hogan
2024-06-18 16:33   ` MSavoritias
2024-06-18 17:31     ` Greg Hogan
2024-06-18 17:57       ` Ian Eure
2024-06-19  7:01       ` MSavoritias
2024-06-19  9:57         ` Efraim Flashner
2024-06-20  2:56         ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-20  5:18           ` MSavoritias
2024-06-19 10:10 ` Efraim Flashner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ed8i4btv.fsf@meson \
    --to=ian@retrospec.tv \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.