On 2024-03-16 12:06:27 -0700, Ian Eure wrote: > > Christopher Baines writes: > > > [[PGP Signed Part:Undecided]] > > > > Ian Eure writes: > > > > > Hi Guixy people, > > > > > > I’d never heard of SWH before I started hacking on Guix last fall, > > > and > > > it struck me as rather a good idea. However, I’ve seen some things > > > lately which have soured me on them. > > > > > > They appear to be using the archive to build LLMs: > > > https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/ > > > > > > I was also distressed to see how poorly they treated a developer who > > > wished to update their name: > > > https://cohost.org/arborelia/post/4968198-the-software-heritag > > > https://cohost.org/arborelia/post/5052044-the-software-heritag > > > > > > GPL’d software I’ve created has been packaged for Guix, which I > > > assume > > > means it’s been included in SWH. While I’m dealing with their (IMO: > > > unethical) opt-out process, I likely also need to stop new copies > > > from > > > being uploaded again in the future. > > > > > > Is there a way to indicate, in a Guix package, that it should > > > *never* > > > be included in SWH? > > > > Not currently, and I don't really see the point in such a mechanism. If > > you really never want them to store your code, then you need to license > > it accordingly (and not make it free software). > > > > I don’t want my code in SWH *because* it’s free. A primary use of LLMs is > laundering freely licensed software into proprietary, commercial projects > through "AI" code completion and generation. Any Free software in an LLM > training set can and will be used in violation of its license, without a > clear path for the author to seek recourse. I deleted my code off Github > and abandoned it completely for this exact reason, and am deeply irked to be > going through this nonsense again. > > A more salient question may be: Is there a process within Guix (either the > program or the organization) which uploads source to SWH? Or does it rely > on SWH indepently? `guix lint PKG-NAME' schedules SWH archival if possible. No code is directly uploaded (at least currently), so assuming you have a IP list of SWH, it should be possible to block it. At least AFAIK. If you have the list, or know how to get it, could you share it? I would be interesting in blocking it as well from my git hosting. > > If the latter, my problem is likely solved by blocking SWH at my network > edge and opting out of their archive (or trying to) and the downstream > training models they’ve already put it in. If the former, the only control > I currently have to protect my license is removing packages from Guix which > contain it. I don’t want that outcome. > > Noting also that the path here seems to be SWH->huggingface->bigcode > training set, and the opt-out process for the training set appears to be a > complete sham. To opt-out, you must create a Github Issue; only one opt-out > has *ever* been processed, and there are 200+ sitting there, many with no > response for nearly a year[1]. I want no part of any of this. > > > > > Is there a way to tell Guix to never download source from SWH? > > > > Also no, and it's probably best to do this at the network level on your > > systems/network if you want this to be the case. > > > > I’ll investigate this, though I’d prefer if there was a way to configure > source mirrors in the Guix daemon. > > > > Skipping back to this though: > > > > > I was also distressed to see how poorly they treated a developer who > > > wished to update their name: > > > https://cohost.org/arborelia/post/4968198-the-software-heritag > > > https://cohost.org/arborelia/post/5052044-the-software-heritag > > > > This is probably worth thinking about as Guix is in a similar situation > > regarding publishing source code, and people potentially wanting to > > change historical source code both in things Guix packages and Guix > > itself. > > > > Like Software Heritage, there's cryptographical implications for > > rewriting the Git history and modifying source tarballs or nars that > > contain source code. > > > > We have 17TiB of compressed source code and built software stored for > > bordeaux.guix.gnu.org now and we should probably work out how to handle > > people asking for things to be removed or changed (for any and all > > reasons). > > > > It's probably worth working out our position on this in advance of > > someone asking. > > > > Yes, I agree that Guix needs a better solution for this. > > Thanks, > > — Ian > > [1]: https://github.com/bigcode-project/opt-out-v2/issues > T. -- There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors.