From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Jim Porter Newsgroups: gmane.emacs.devel Subject: Re: [NonGNU ELPA] New package: llm Date: Sun, 20 Aug 2023 23:03:30 -0700 Message-ID: <8c8d1109-d6f3-70b5-010b-31042b5baa18@gmail.com> References: <54c21d90-8bd6-8723-9e33-d69179b37bd0@gmail.com> <705ab838-142a-b3cc-8cc8-6f4d143c4341@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="1453"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Daniel Fleischer , Richard Stallman , emacs-devel@gnu.org To: Andrew Hyatt Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Mon Aug 21 08:04:11 2023 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qXy1L-000072-6o for ged-emacs-devel@m.gmane-mx.org; Mon, 21 Aug 2023 08:04:11 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qXy0q-0004Io-GK; Mon, 21 Aug 2023 02:03:40 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qXy0m-0004IY-Jq for emacs-devel@gnu.org; Mon, 21 Aug 2023 02:03:36 -0400 Original-Received: from mail-ot1-x331.google.com ([2607:f8b0:4864:20::331]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qXy0k-00036W-1N; Mon, 21 Aug 2023 02:03:36 -0400 Original-Received: by mail-ot1-x331.google.com with SMTP id 46e09a7af769-6bda8559fddso559294a34.1; Sun, 20 Aug 2023 23:03:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1692597811; x=1693202611; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:mime-version:date:message-id:from:to:cc :subject:date:message-id:reply-to; bh=lFvPCv0Cm+vLyIYXeKxQ6/l3RXiFd1nNSeOHDb0S8qQ=; b=eoIZWglfkwiqfwIRvYKSLtMKDbbLLHh4ubYpr+thsrWhRGkEhBLxXTcG/trS+ycqjC geIXaVUqc5n3Paagfjxe6lVHxaNZPSqJGxkqZz+55oCDulFVj/K5DDeFDTHW0G7JzCI6 JFn7OrxXZ/nbhdI4VwqqZs1Uata7NaJ6dUAolBdJS6AkYAT/QOWrPw9MskgaAkixYXPM OpnftM88H1pR0SAWWCezbSMZIb2dSMI5K8avJtqAxzF7NopSBD147QSCSGj0mi76JrUi LR7jBW14XgL+dyJUArLNRIhRtQUNP/d6yNYlk3dVtGKjGIueRpAM9Uq/SRU9O4QgJ0oG J5ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692597811; x=1693202611; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lFvPCv0Cm+vLyIYXeKxQ6/l3RXiFd1nNSeOHDb0S8qQ=; b=Vy/ZNWjp6S7d5NQcZe64m6MFYGliJ5sy6vlBlIciDRMooTOcYSQHBfHvOddc1D8x1W mOZ155ap5JPZ4vDCq6S6bmVXiad8CYvayCb1GADHyx/yz6XDXl6fYfgSEchEVZ6Tx4eC +oHwnh6aoqBAToAvdxvLUVzOPhof2XlIsF51mPtjq0F6UbRk575A6fG4xXEhxiiMBUuR sqq/IQoIUHzsODBXeNXRAyrIHLl0Sf5veXer/+JdDXWbwrNcpkh3JsoQ+6Z9GfMgfC3m F0X8wB4zRUSCMPSaNyQrggg0PGmGayDSR0UdI81aOKB1kqy5Qky8gvvnVS6zI06aRO4z RZ1g== X-Gm-Message-State: AOJu0Yx6PLITS/wL9TSHHfCHFS3uvx8lXhDu3O0lTGCp3+u43EpR8/Tj nZ8rebJSHjoG4FTQop5RnN4= X-Google-Smtp-Source: AGHT+IHpysuWJR9Qbx4lJe/dqlBL5jdDxdGICJeE9Gq4Va51OzOlkwBTmeuOgAY2tzl3HBrwL8mlpw== X-Received: by 2002:a05:6830:20cc:b0:6b5:8a98:f593 with SMTP id z12-20020a05683020cc00b006b58a98f593mr5879530otq.8.1692597811664; Sun, 20 Aug 2023 23:03:31 -0700 (PDT) Original-Received: from [192.168.1.2] (cpe-76-168-148-233.socal.res.rr.com. [76.168.148.233]) by smtp.googlemail.com with ESMTPSA id q23-20020a637517000000b00563c1aa277asm5646479pgc.6.2023.08.20.23.03.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Aug 2023 23:03:31 -0700 (PDT) Content-Language: en-US In-Reply-To: Received-SPF: pass client-ip=2607:f8b0:4864:20::331; envelope-from=jporterbugs@gmail.com; helo=mail-ot1-x331.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:309025 Archived-At: On 8/20/2023 10:12 PM, Andrew Hyatt wrote: > The training of these is fairly straightforward, at least if you are > familiar with the area. ... the LLM we are talking about here use this technique to train and execute, changing some parameters and adding things like more attention heads, but keeping the fundamental architecture the same. I think the parameters would be a key part of this (or potentially all of the code they used for the training, if it does something unique), as well as the *actual* training datasets. That's why I'm especially concerned about the line in their docs saying "great efforts have been taken to clean the pretraining data". I couldn't find out whether they provided the cleaned data or only the "raw" data. From my understanding, properly cleaning the data is labor-intensive, and you wouldn't be able to reproduce another team's efforts in that area unless they gave you a diff or something equivalent. > I'm not an expert, but I believe that due to the use of stochastic > processes in training, even if you had the exact code, parameters and > data used in training, you would never be able to reproduce the model > they make available.  It should be equivalent in quality, perhaps, but > not the same. This is a problem for reproducibility (it would be nice if you could *verify* that a model was built the way its makers said it was), but I don't think it's a critical problem for freedom. > To me, I believe it should be about freedom.  Not absolute freedom, but > relative freedom: do you, the user, have the same amount of freedom as > anyone else, including the creator?  For the LLMs like huggingface and > many other research LLMs, the answer is yes. So long as the creators provide all the necessary parameters to retrain the model from "scratch", I think I'd agree. If some of these aren't provided (cleaned datasets, training parameters, any direct human intervention if applicable, etc), then I think the answer is no. For example, the creator could decide that one data source is bad for some reason, and retrain their model without it. Would I be able to do that work independently with just what the creator has given me? I see that there was a presentation at LibrePlanet 2023 (or maybe shortly after) by Leandro von Werra of HuggingFace on the ethics of code-generating LLMs[1]. It says that it hasn't been published online yet, though. This might not be the final answer on all the concerns about incorporating LLMs into Emacs, but hopefully it would help. In practice though, I think if Emacs were to support communicating with LLMs, it would be good if - at minimum - we could direct users to an essay explaining the potential ethical/freedom issues with them. On that note, maybe we could also take a bit of inspiration from Emacs dynamic modules. They require a GPL compatibility symbol[2] in order to load, and perhaps a hypothetical 'llm-foobar' package that interfaces with the 'foobar' LLM could announce whether it respects users' freedom via some variable/symbol. Freedom-respecting LLMs wouldn't need a warning message then. We could even forbid packages that talk to particularly "bad" LLMs. (I suppose we can't stop users from writing their own packages and just lying about whether they're ok, but we could prevent their inclusion in ELPA.) [1] https://www.fsf.org/bulletin/2023/spring/trademarks-volunteering-and-code-generating-llm [2] https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Initialization.html