From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Andrew Hyatt Newsgroups: gmane.emacs.devel Subject: Re: LLM Experiments, Part 1: Corrections Date: Wed, 24 Jan 2024 10:55:14 -0400 Message-ID: References: <87plxrr06f.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="4902"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: emacs-devel@gnu.org, sskostyaev@gmail.com To: contact@karthinks.com Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Jan 24 15:56:13 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1rSefl-00010w-5X for ged-emacs-devel@m.gmane-mx.org; Wed, 24 Jan 2024 15:56:13 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rSeey-0003Mr-2I; Wed, 24 Jan 2024 09:55:24 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rSeew-0003MF-ET for emacs-devel@gnu.org; Wed, 24 Jan 2024 09:55:22 -0500 Original-Received: from mail-vk1-xa36.google.com ([2607:f8b0:4864:20::a36]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rSeeu-0000cL-7H for emacs-devel@gnu.org; Wed, 24 Jan 2024 09:55:22 -0500 Original-Received: by mail-vk1-xa36.google.com with SMTP id 71dfb90a1353d-4bd691f1038so375395e0c.1 for ; Wed, 24 Jan 2024 06:55:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706108119; x=1706712919; darn=gnu.org; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=a3BOQHX3l4kQEdBZjcpqAZOs5mbyvbJnffCG9Tm8x90=; b=LeT++YxDS/5rpvyzgiETBN2s8FHzmrsTOXVGQ9+TAVsSW1Pu6pkiZ0e+dD9nGFlXlq EHICO4DDUQmG7ZqqOy3R6GezWl+/F/t0QJ2uPY7ZpDf1C16loYVjNHPQAsfyW5yju6/6 0jfhGUOJtAwcuQs78NO5Q8PqOvswxYa42jmN/t4vx/ZZ18HvH+MOX0srGjTVBJSJgbXw Ke9BGhAsuSvqgLLg/tprInhJG/GHAX5//M2TgTkqpBPZpUDhHMNg7G8+WFPjjoTI1fvu s2Cxj/gCvzE4rIf5kXj1cM63kPktPJqZMf3bY/H5CE7ntvRI2hNNz8GWn/eHL5Fl/2qv sCUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706108119; x=1706712919; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=a3BOQHX3l4kQEdBZjcpqAZOs5mbyvbJnffCG9Tm8x90=; b=u4OF7yB1+8W0qd4x/Fv1dapEfc6lylCSsgFvJu2+f8aTDeGZvIu2XxA9c9rvcREMqL extwndh50STl255cz44OafIUVaCUcOn/1j7ZXsnmUEb/4CPFNxOyNaD/JAhFXlfeLHHV 6fbxFVQtXm0fYmUdh6tFiNJSU2a+eZMb302uZMbcDMjNdwsU2NeWRGdUgk2eC7en3n+O ndqc7RcbLz4J389/mgMqSi9ArERqsZ4eKE7DM36M1/4o7/a4cQaS/uwT2s5O1N/FCbIX Zpp/QY6UwFGXlDIU1/6UscEuzewMZAJjmZLEYfuUUKE+uX19O4dRIvEBwkKgXYV6gme5 cXig== X-Gm-Message-State: AOJu0YwekiJ2hORgnt/iCfZK9LuULWVvy4XT9oa9jvovMZvKeTHU/Pp7 XdsyVlvvYg9hmMF23z7OpPrKvu5M6Y5Ol6CtxAnaGajJqsl2OFCtcaq9SOKFH+PLtw== X-Google-Smtp-Source: AGHT+IEnNc0c4HD6uwhg7s9AotWWkxtAqod+JH6nwbGqQIfVb/F0x0jp9LACCI/BGkx2XT1Jfg0t7w== X-Received: by 2002:ac5:c318:0:b0:4bb:3b8:afbd with SMTP id j24-20020ac5c318000000b004bb03b8afbdmr721549vkk.0.1706108118234; Wed, 24 Jan 2024 06:55:18 -0800 (PST) Original-Received: from ahyatt-home.local ([190.83.214.104]) by smtp.gmail.com with ESMTPSA id s16-20020ac5c950000000b004ba1af87786sm2446728vkm.2.2024.01.24.06.55.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Jan 2024 06:55:17 -0800 (PST) In-Reply-To: <87plxrr06f.fsf@gmail.com> (contact@karthinks.com's message of "Tue, 23 Jan 2024 17:26:48 -0800") Received-SPF: pass client-ip=2607:f8b0:4864:20::a36; envelope-from=ahyatt@gmail.com; helo=mail-vk1-xa36.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:315311 Archived-At: Thanks for this super useful response. BTW, I'm going to try again to use gnus to respond, after making some changes, so apologies if the formatting goes awry. If it does, I'll re-respond in gmail. On Tue, Jan 23, 2024 at 05:26 PM contact@karthinks.com wrote: > Hi Andrew, > > Having worked on similar problems in gptel for about nine months now > (without much success), here are some thoughts. > >> Question 1: Does the llm-flows.el file really belong in the llm >> package? It does help people code against llms, but it expands >> the scope of the llm package from being just about connecting to >> different LLMs to offering a higher level layer necessary for >> these more complicated flows. I think this probably does make >> sense, there's no need to have a separate package just for this >> one part. > > I think llm-flows works better as an independent package for now, > since it's not clear yet what the "primitive" operations of working > with LLMs look like. I suspect these usage patterns will also be a > matter of preference for users, as I've realized from comparing how I > use gptel to the feature requests I get. I agree that the more we assume specific patterns (which indeed seems to be where things are going), the more it should go into its own package (or maybe be part of ellama or something). But there may be some commonality that are just useful regardless of the usage patterns. I think we'll have to see how this plays out. >> Question 2: What's the best way to write these flows with multiple >> stages, in which some stages sometimes need to be repeated? It's >> kind of a state machine when you think about it, and there's a >> state machine GNU ELPA library already (fsm). I opted to not model >> it explicitly as a state machine, optimizing instead to just use >> the most straightforward code possible. > > An FSM might be overkill here. At the same time, I'm not sure that > all possible interactions will fit this multi-step paradigm like > rewriting text does. > >> Question 3: How should we deal with context? The code that has the >> text corrector doesn't include surrounding context (the text >> before and after the text to rewrite), but it usually is helpful. >> How much context should we add? The llm package does know about >> model token limits, but more tokens add more cost in terms of >> actual money (per/token billing for services, or just the CPU >> energy costs for local models). Having it be customizable makes >> sense to some extent, but users are not expected to have a good >> sense of how much context to include. My guess is that we should >> just have a small amount of context that won't be a problem for >> most models. But there's other questions as well when you think >> about context generally: How would context work in different >> modes? What about when context may spread in multiple files? It's >> a problem that I don't have any good insight into yet. > > I see different questions here: > > 1. What should be the default amount of context included with > requests? > 2. How should this context be determined? (single buffer, across > files etc) > 3. How should this be different between modes of usage, and how > should this be communicated unambiguously? > 4. Should we track token costs (when applicable) and communicate them > to the user? > > Some lessons from gptel, which focuses mostly on a chat interface: > > 1. Users seem to understand gptel's model intuitively since they > think of it like a chat application, where the context is expected to > be everything in the buffer up to the cursor position. The only > addition is to use the region contents instead when the region is is > active. This default works well for more than chat, actually. It's > good enough when rewriting-in-place or for continuing your prose/code. I think it may be still useful to use the context even when modifying the region, though. > > 2. This is tricky, I don't have any workable ideas yet. In gptel > I've experimented with providing the options "full buffer" and "open > project buffers" in addition to the default, but these are both > overkill, expensive and rarely useful -- they often confuse the LLM > more than they help. Additionally, in Org mode documents I've > experimented with using sparse trees as the context -- this is > inexpensive and can work very well but the document has to be > (semantically) structured a certain way. This becomes obvious after a > couple of sessions, but the behavior has to be learned nevertheless. > > 3a. For coding projects I think it might be possible to construct a > "sparse tree" with LSP or via treesitter, and send (essentially) an > "API reference" along with smaller chunks of code. This should make > basic copilot-style usage viable. I don't use LSP or treesitter > seriously, so I don't know how to do this. > > 3b. Communicating this unambiguously to users is a UI design question, > and I can imagine many ways to do it. Thanks, this is very useful. My next demo is with org-mode, and there I'm currently sending the tree (just the structure) as context. There's a meta-question here is that we don't know how what the right thing to do is, because we don't have a quality process. If we were doing this seriously, we'd want a corpus of examples to test with, and a way to judge quality (LLMs can actually do this as well). The fact that people could be running any model is a significant complicating factor. But as it is, we just have our own anecdotal evidences, and reasonable hunches. As these things get better, I think it would converge to what we would think is reasonable context anyway. > > 4. I think optionally showing the cumulative token count for a > "session" (however defined) makes sense. The token count should only be applicable to the current LLM count. My understanding is that the machinery around the LLMs is adding the previous conversation, in whole, or in summary, as context. So it's really hard to understand the actual token usage once you start having conversations. That's fine, though, since most tokens are used in the first message where context appears, and subsequent rounds are fairly light. > >> Question 5: Should there be a standard set of user behaviors about >> editing the prompt? In another demo (one I'll send as a followup), >> with a universal argument, the user can edit the prompt, minus >> context and content (in this case the content is the text to >> correct). Maybe that should always be the case. However, that >> prompt can be long, perhaps a bit long for the minibuffer. Using a >> buffer instead seems like it would complicate the flow. Also, if >> the context and content is embedded in that prompt, they would >> have to be replaced with some placeholder. I think the prompt >> should always be editable, we should have some templating system. >> Perhaps emacs already has some templating system, and one that can >> pass arguments for number of tokens from context would be nice. > > Another unsolved problem in gptel right now. Here's what it uses > currently: > > - prompt: from the minibuffer > - context and content: selected region only > > The main problem with including context separate from the content here > is actually not the UI, it's convincing the LLM to consistently > rewrite only the content and use the context as context. Using the > prompt+context as the "system message" works, but not all LLM APIs > provide a "system message" field. Yes, the LLM library already separates the "system message" (which we call context, which I now realize is a bit confusing), and each provider just deals with it in the best way possible. Hopefully as LLM instruction following gets better, it will stop treating context as anything other than context. In the end, IIUC, it all ends up in the same place when feeding text to the LLM anyway. > >> Question 6: How do we avoid having a ton of very specific >> functions for all the various ways that LLMs can be used? Besides >> correcting text, I could have had it expand it, summarize it, >> translate it, etc. Ellama offers all these things (but without the >> diff and other workflow-y aspects). I think these are too much for >> the user to remember. > > Yes, this was the original reason I wrote gptel -- the first few > packages for LLM interaction (only GPT-3.5 back then) wanted to > prescribe the means of interaction via dedicated commands, which I > thought overwhelmed the user while also missing what makes LLMs > different from (say) language checkers like proselint and vale, and > from code refactoring tools. > >> It'd be nice to have one function when the >> user wants to do something, and we work out what to do in the >> workflow. But the user shouldn't be developing the prompt >> themselves; at least at this point, it's kind of hard to just >> think of everything you need to think of in a good prompt. They >> need to be developed, updated, etc. What might be good is a system >> in which the user chooses what they want to do to a region as a >> secondary input, kind of like another kind of >> execute-extended-command. > > I think having users type out their intention in natural language into > a prompt is fine -- the prompt can then be saved and added to a > persistent collection. We will never be able to cover > (programmatically) even a reasonable fraction of the things the user > might want to do. Agreed. Increasingly, it seems like a really advanced prompt management system might be necessary. How to do that is it's own separate set of questions. How should they be stored? Are these variables? Files? Data in sqlite? > > The things the user might need help with is what I'd call "prompt > decoration". There are standard things you can specify in a prompt to > change the brevity and tone of a response. LLMs tend to generate > purple prose, summarize their responses, apologize or warn > excessively, etc. We can offer a mechanism to quick-add templates to > the prompt to stem these behaviors, or encourage other ones. Agreed. I have a prompting system in my ekg project, and it allows transcluding other prompts, which is quite useful. So you might want to just include information about your health, or your project, or even more dynamic things like the date, org agenda for the day, or whatever. This is powerful and a good thing to include in this future prompt management system. > > Karthik