* Re: LLM Experiments, Part 1: Corrections
[not found] <m2il3mj961.fsf@gmail.com>
@ 2024-01-22 18:50 ` Sergey Kostyaev
2024-01-22 20:31 ` Andrew Hyatt
2024-01-23 1:36 ` João Távora
` (3 subsequent siblings)
4 siblings, 1 reply; 22+ messages in thread
From: Sergey Kostyaev @ 2024-01-22 18:50 UTC (permalink / raw)
To: Andrew Hyatt; +Cc: emacs-devel
[-- Attachment #1: Type: text/plain, Size: 8358 bytes --]
Hello everyone,
This is cool Idea, I will definitely use it in ellama. But I have some suggestions:
1. Every prompt should be customizable. And since llm is low level library, it should be customizable at function call time (to manage custom variables on caller’s side). Or easier will be to reimplement this functionality.
2. Maybe it will be useful to make corrections other way (not instead of current solution, but together with it): user press some keybinding and change prompt or other parameters and redo query. Follow up revision also useful, so don’t remove it.
About your questions:
1. I think it should be different require calls, but what package it will be - doesn’t matter for me. Do it anyhow you will be comfortable with.
2. I don’t know fsm library, but understand how to manage finite state machines. I would prefer simpler code. If it will be readable with this library - ok, if without - also fine.
3. This should have small default length (256 - 1000 tokens or words or something like that) and be extendable by caller’s code. This should be different in different scenarios. Need maximum flexibility here.
4. 20 seconds of blocked Emacs is way too long. Some big local models are very good, but not very fast. For example mixtral 8x7b instruct provides great quality, but not very fast. I prefer not break user’s flow by blocking. I think configurable ability to show generation stream (or don’t show if user don’t want it) will be perfect.
5. See https://github.com/karthink/gptel as an example of flexibility.
6. Emacs has great explainability. There are ‘M-x’ commands, which-key integration for faster remembering keybindings. And we can add other interface (for example, grouping actions by meaning with completing-read, for example).
Best regards,
Sergey Kostyaev
> 22 янв. 2024 г., в 11:15, Andrew Hyatt <ahyatt@gmail.com> написал(а):
>
>
> Hi everyone,
>
> This email is a demo and a summary of some questions which could use your feedback in the context of using LLMs in Emacs, and specifically the development of the llm GNU ELPA package. If that interests you, read on.
>
> I'm starting to experiment with what LLMs and Emacs, together, are capable of. I've written the llm package to act as a base layer, allowing communication various LLMs: servers, local LLMs, free, and nonfree. ellama, also a GNU ELPA package, is also showing some interesting functionality - asking about a region, translating a region, adding code, getting a code review, etc.
>
> My goal is to take that basic approach that ellama is doing (providing useful functionality beyond chat that only the LLM can give), and expand it to a new set of more complicated interactions. Each new interaction is a new demo, and as I write them, I'll continue to develop a library that can support these more complicated experiences. The demos should be interesting, and more importantly, developing them brings up interesting questions that this mailing list may have some opinions on.
>
> To start, I have a demo of showing the user using an LLM to rewrite existing text.
>
> <rewrite-demo.gif>
> I've created a function that will ask for a rewrite of the current region. The LLM offers a suggestion, which the user can review with ediff, and ask for a revision. This can continue until the user is satisfied, and then the user can accept the rewrite, which will replace the region.
>
> You can see the version of code in a branch of my llm source here:
> https://raw.githubusercontent.com/ahyatt/llm/flows/llm-flows.el
>
> And you can see the code that uses it to write the text corrector function here:
> https://gist.githubusercontent.com/ahyatt/63d0302c007223eaf478b84e64bfd2cc/raw/c1b89d001fcbe948cf563d5ee2eeff00976175d4/llm-flows-example.el
>
> There's a few questions I'm trying to figure out in all these demos, so let me state them and give my current guesses. These are things I'd love feedback on.
>
> Question 1: Does the llm-flows.el file really belong in the llm package? It does help people code against llms, but it expands the scope of the llm package from being just about connecting to different LLMs to offering a higher level layer necessary for these more complicated flows. I think this probably does make sense, there's no need to have a separate package just for this one part.
>
> Question 2: What's the best way to write these flows with multiple stages, in which some stages sometimes need to be repeated? It's kind of a state machine when you think about it, and there's a state machine GNU ELPA library already (fsm). I opted to not model it explicitly as a state machine, optimizing instead to just use the most straightforward code possible.
>
> Question 3: How should we deal with context? The code that has the text corrector doesn't include surrounding context (the text before and after the text to rewrite), but it usually is helpful. How much context should we add? The llm package does know about model token limits, but more tokens add more cost in terms of actual money (per/token billing for services, or just the CPU energy costs for local models). Having it be customizable makes sense to some extent, but users are not expected to have a good sense of how much context to include. My guess is that we should just have a small amount of context that won't be a problem for most models. But there's other questions as well when you think about context generally: How would context work in different modes? What about when context may spread in multiple files? It's a problem that I don't have any good insight into yet.
>
> Question 4: Should the LLM calls be synchronous? In general, it's not great to block all of Emacs on a sync call to the LLM. On the other hand, the LLM calls are generally fast enough (a few seconds, the current timeout is 20s) that the user isn't going to be accomplishing much while the LLM works, and is likely to get into a state where the workflow is waiting for their input and we have to get them back to a state where they are interacting with the workflow. Streaming calls are a way that works well for just getting a response from the LLM, but when we have a workflow, the response isn't useful until it is processed (in the demo's case, until it is an input into ediff-buffers). I think things have to be synchronous here.
>
> Question 5: Should there be a standard set of user behaviors about editing the prompt? In another demo (one I'll send as a followup), with a universal argument, the user can edit the prompt, minus context and content (in this case the content is the text to correct). Maybe that should always be the case. However, that prompt can be long, perhaps a bit long for the minibuffer. Using a buffer instead seems like it would complicate the flow. Also, if the context and content is embedded in that prompt, they would have to be replaced with some placeholder. I think the prompt should always be editable, we should have some templating system. Perhaps emacs already has some templating system, and one that can pass arguments for number of tokens from context would be nice.
>
> Question 6: How do we avoid having a ton of very specific functions for all the various ways that LLMs can be used? Besides correcting text, I could have had it expand it, summarize it, translate it, etc. Ellama offers all these things (but without the diff and other workflow-y aspects). I think these are too much for the user to remember. It'd be nice to have one function when the user wants to do something, and we work out what to do in the workflow. But the user shouldn't be developing the prompt themselves; at least at this point, it's kind of hard to just think of everything you need to think of in a good prompt. They need to be developed, updated, etc. What might be good is a system in which the user chooses what they want to do to a region as a secondary input, kind of like another kind of execute-extended-command.
>
> These are the issues as I see them now. As I continue to develop demos, and as people in the list give feedback, I'll try to work through them.
>
> BTW, I plan on continuing these emails, one for every demo, until the questions seem worked out. If this mailing list is not the appropriate place for this, let me know.
[-- Attachment #2: Type: text/html, Size: 8999 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-22 18:50 ` LLM Experiments, Part 1: Corrections Sergey Kostyaev
@ 2024-01-22 20:31 ` Andrew Hyatt
2024-01-22 22:06 ` T.V Raman
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Hyatt @ 2024-01-22 20:31 UTC (permalink / raw)
To: Sergey Kostyaev; +Cc: emacs-devel
On 23 January 2024 01:50, Sergey Kostyaev <sskostyaev@gmail.com>
wrote:
Hello everyone, This is cool Idea, I will definitely use it in
ellama. But I have some suggestions: 1. Every prompt should be
customizable. And since llm is low level library, it should be
customizable at function call time (to manage custom variables
on caller’s side). Or easier will be to reimplement this
functionality. 2. Maybe it will be useful to make corrections
other way (not instead of current solution, but together with
it): user press some keybinding and change prompt or other
parameters and redo query. Follow up revision also useful, so
don’t remove it. About your questions: 1. I think it should
be different require calls, but what package it will be -
doesn’t matter for me. Do it anyhow you will be comfortable
with. 2. I don’t know fsm library, but understand how to
manage finite state machines. I would prefer simpler code. If
it will be readable with this library - ok, if without - also
fine.
Agree to all the above. Seems worth trying out fsm, but not sure
how much it will help.
3. This should have small default length (256 - 1000 tokens or
words or something like that) and be extendable by caller’s
code. This should be different in different scenarios. Need
maximum flexibility here.
Agreed, probably a small default length is sufficient - but it
might be good to have options for maximizing the length. The
extensibility here may be tricky to design, but it's important.
4. 20 seconds of blocked Emacs is way too long. Some big local
models are very good, but not very fast. For example mixtral
8x7b instruct provides great quality, but not very fast. I
prefer not break user’s flow by blocking. I think configurable
ability to show generation stream (or don’t show if user don’t
want it) will be perfect.
How do you see this working in the demo I shared, though?
Streaming wouldn't help at all, AFAICT. If you don't block, how
does the user get to the ediff screen? Does it just pop up in the
middle of whatever they were doing? That seems intrusive. Better
would be to message to the user that they can do something to get
back into the workflow. Still, at least for me, I'd prefer to just
wait. I'm doing something that I'm turning my attention to, so
even if it takes a while, I want to maintain my focus on that
task. At least as long as I don't get bored, but LLMs are fast
enough that I'm not losing focus here.
5. See https://github.com/karthink/gptel as an example of
flexibility.
Agreed, it's a very full system for prompt editing.
6. Emacs has great explainability. There are ‘M-x’ commands, which-key integration for faster remembering keybindings. And we can
add other interface (for example, grouping actions by meaning with completing-read, for example).
Best regards,
Sergey Kostyaev
22 янв. 2024 г., в 11:15, Andrew Hyatt <ahyatt@gmail.com> написал(а):
Hi everyone,
This email is a demo and a summary of some questions which could use your feedback in the context of using LLMs in Emacs, and
specifically the development of the llm GNU ELPA package. If that interests you, read on.
I'm starting to experiment with what LLMs and Emacs, together, are capable of. I've written the llm package to act as a base layer,
allowing communication various LLMs: servers, local LLMs, free, and nonfree. ellama, also a GNU ELPA package, is also
showing some interesting functionality - asking about a region, translating a region, adding code, getting a code review, etc.
My goal is to take that basic approach that ellama is doing (providing useful functionality beyond chat that only the LLM can give),
and expand it to a new set of more complicated interactions. Each new interaction is a new demo, and as I write them, I'll continue
to develop a library that can support these more complicated experiences. The demos should be interesting, and more importantly,
developing them brings up interesting questions that this mailing list may have some opinions on.
To start, I have a demo of showing the user using an LLM to rewrite existing text.
<rewrite-demo.gif>
I've created a function that will ask for a rewrite of the current region. The LLM offers a suggestion, which the user can review with
ediff, and ask for a revision. This can continue until the user is satisfied, and then the user can accept the rewrite, which will replace
the region.
You can see the version of code in a branch of my llm source here:
https://raw.githubusercontent.com/ahyatt/llm/flows/llm-flows.el
And you can see the code that uses it to write the text corrector function here:
https://gist.githubusercontent.com/ahyatt/63d0302c007223eaf478b84e64bfd2cc/raw/c1b89d001fcbe948cf563d5ee2eeff00976175d4/llm-flows-example.el
There's a few questions I'm trying to figure out in all these demos, so let me state them and give my current guesses. These are
things I'd love feedback on.
Question 1: Does the llm-flows.el file really belong in the llm package? It does help people code against llms, but it expands the
scope of the llm package from being just about connecting to different LLMs to offering a higher level layer necessary for these more
complicated flows. I think this probably does make sense, there's no need to have a separate package just for this one part.
Question 2: What's the best way to write these flows with multiple stages, in which some stages sometimes need to be repeated? It's
kind of a state machine when you think about it, and there's a state machine GNU ELPA library already (fsm). I opted to not model
it explicitly as a state machine, optimizing instead to just use the most straightforward code possible.
Question 3: How should we deal with context? The code that has the text corrector doesn't include surrounding context (the text
before and after the text to rewrite), but it usually is helpful. How much context should we add? The llm package does know about
model token limits, but more tokens add more cost in terms of actual money (per/token billing for services, or just the CPU energy
costs for local models). Having it be customizable makes sense to some extent, but users are not expected to have a good sense of
how much context to include. My guess is that we should just have a small amount of context that won't be a problem for most
models. But there's other questions as well when you think about context generally: How would context work in different modes?
What about when context may spread in multiple files? It's a problem that I don't have any good insight into yet.
Question 4: Should the LLM calls be synchronous? In general, it's not great to block all of Emacs on a sync call to the LLM. On the
other hand, the LLM calls are generally fast enough (a few seconds, the current timeout is 20s) that the user isn't going to be
accomplishing much while the LLM works, and is likely to get into a state where the workflow is waiting for their input and we
have to get them back to a state where they are interacting with the workflow. Streaming calls are a way that works well for just
getting a response from the LLM, but when we have a workflow, the response isn't useful until it is processed (in the demo's case,
until it is an input into ediff-buffers). I think things have to be synchronous here.
Question 5: Should there be a standard set of user behaviors about editing the prompt? In another demo (one I'll send as a
followup), with a universal argument, the user can edit the prompt, minus context and content (in this case the content is the text to
correct). Maybe that should always be the case. However, that prompt can be long, perhaps a bit long for the minibuffer. Using a
buffer instead seems like it would complicate the flow. Also, if the context and content is embedded in that prompt, they would have
to be replaced with some placeholder. I think the prompt should always be editable, we should have some templating system.
Perhaps emacs already has some templating system, and one that can pass arguments for number of tokens from context would be
nice.
Question 6: How do we avoid having a ton of very specific functions for all the various ways that LLMs can be used? Besides
correcting text, I could have had it expand it, summarize it, translate it, etc. Ellama offers all these things (but without the diff and
other workflow-y aspects). I think these are too much for the user to remember. It'd be nice to have one function when the user wants
to do something, and we work out what to do in the workflow. But the user shouldn't be developing the prompt themselves; at least
at this point, it's kind of hard to just think of everything you need to think of in a good prompt. They need to be developed, updated,
etc. What might be good is a system in which the user chooses what they want to do to a region as a secondary input, kind of like
another kind of execute-extended-command.
These are the issues as I see them now. As I continue to develop demos, and as people in the list give feedback, I'll try to work
through them.
BTW, I plan on continuing these emails, one for every demo, until the questions seem worked out. If this mailing list is not the
appropriate place for this, let me know.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-22 20:31 ` Andrew Hyatt
@ 2024-01-22 22:06 ` T.V Raman
2024-01-23 0:52 ` Andrew Hyatt
0 siblings, 1 reply; 22+ messages in thread
From: T.V Raman @ 2024-01-22 22:06 UTC (permalink / raw)
To: Andrew Hyatt; +Cc: Sergey Kostyaev, emacs-devel
Some more related thoughts below, mostly thinking aloud:
1. From using gptel and ellama against the same model, I see different
style responses, and that kind of inconsistency would be good to get
a handle on; LLMs are difficult enough to figure out re what they're
doing without this additional variation.
2. Package LLM has the laudible goal of bridgeing between models and
front-ends, and this is going to be vital.
3. (1,2) above lead to the following question:
4. Can we write down a list of common configuration vars --- here
common across the model axis. Make it a union of all such params.
5. Next, write down a list of all configurable params on the UI side.
6. When stable, define a single data-structure in elisp that acts as
the bridge between the front-end emacs UI and the LLM module.
7. Finally factor out the settings of that structure and make it
possible to create "profiles" so that one can predictably experiment
across front-ends and models.
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-22 22:06 ` T.V Raman
@ 2024-01-23 0:52 ` Andrew Hyatt
2024-01-23 1:57 ` T.V Raman
2024-01-23 3:00 ` Emanuel Berg
0 siblings, 2 replies; 22+ messages in thread
From: Andrew Hyatt @ 2024-01-23 0:52 UTC (permalink / raw)
To: T.V Raman; +Cc: Sergey Kostyaev, emacs-devel
On 22 January 2024 14:06, "T.V Raman" <raman@google.com> wrote:
Some more related thoughts below, mostly thinking aloud: 1.
From using gptel and ellama against the same model, I see
different
style responses, and that kind of inconsistency would be
good to get a handle on; LLMs are difficult enough to
figure out re what they're doing without this additional
variation.
Is this keeping the prompt and temperature constant? There's
inconsistency, though, even keeping everything constant due to the
randomness of the LLM. I often get very different results, for
example, to make the demo I shared, I had to run it like 5 times
because it would either do things too well (no need to demo
corrections), or not well enough (for example, it wouldn't follow
my orders to put everything in one paragraph).
2. Package LLM has the laudible goal of bridgeing between
models and
front-ends, and this is going to be vital.
3. (1,2) above lead to the following question: 4. Can we
write down a list of common configuration vars --- here
common across the model axis. Make it a union of all such
params.
I think the list of common model-and-prompt configuration should
already be already in the llm package already, but we probably
will need to keep expanding this.
5. Next, write down a list of all configurable params on the
UI side.
This will change quite a bit depending on the task. It's unclear
how much should be configurable - for example, in the demo, I have
ediff so the user can see and evaluate the diff. But maybe that
should be configurable, so if the user wants to see just a diff
output instead, perhaps that should be allowed? When I was
thinking about a state machine, I was thinking that parts of the
state machine might be overridable by the user, such as a "have
the user check the results of the operation" is a state in the
state machine that the user can just define their own function
for. I suspect we'll have a better idea of this after a few more
demos.
6. When stable, define a single data-structure in elisp that
acts as
the bridge between the front-end emacs UI and the LLM
module.
If I understand you correctly, this would be the configuration you
listed in your point (4) and (5)?
7. Finally factor out the settings of that structure and make
it
possible to create "profiles" so that one can predictably
experiment across front-ends and models.
I like this idea, thanks!
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-23 0:52 ` Andrew Hyatt
@ 2024-01-23 1:57 ` T.V Raman
2024-01-23 3:00 ` Emanuel Berg
1 sibling, 0 replies; 22+ messages in thread
From: T.V Raman @ 2024-01-23 1:57 UTC (permalink / raw)
To: ahyatt; +Cc: raman, sskostyaev, emacs-devel
I'm seeing short responses with gptel that are good; roundabout
responses for the same question with ellama.
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-23 0:52 ` Andrew Hyatt
2024-01-23 1:57 ` T.V Raman
@ 2024-01-23 3:00 ` Emanuel Berg
2024-01-23 3:49 ` Andrew Hyatt
1 sibling, 1 reply; 22+ messages in thread
From: Emanuel Berg @ 2024-01-23 3:00 UTC (permalink / raw)
To: emacs-devel
Andrew Hyatt wrote:
> [...] 1. From using gptel and ellama against the same
> model, I see different style responses, and that
> kind of inconsistency would be good to get a handle on;
> LLMs are difficult enough to figure out re what they're
> doing without this additional variation.
>
> Is this keeping the prompt and temperature constant? There's
> inconsistency, though, even keeping everything constant due to
> the randomness of the LLM. I often get very different
> results, for example, to make the demo I shared, I had to run
> it like 5 times because it would either do things too well (no
> need to demo corrections), or not well enough (for example, it
> wouldn't follow my orders to put everything in one paragraph).
>
> 2. Package LLM has the laudible goal of bridgeing between
> models and front-ends, and this is going to be
> vital. 3. (1,2) above lead to the following question:
> 4. Can we write down a list of common configuration
> vars --- here common across the model axis. Make it
> a union of all such params. [...]
Uhm, pardon me for asking but why are the e-mails looking
like this?
--
underground experts united
https://dataswamp.org/~incal
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-23 3:00 ` Emanuel Berg
@ 2024-01-23 3:49 ` Andrew Hyatt
0 siblings, 0 replies; 22+ messages in thread
From: Andrew Hyatt @ 2024-01-23 3:49 UTC (permalink / raw)
To: emacs-devel
[-- Attachment #1: Type: text/plain, Size: 1701 bytes --]
Thanks for pointing this out - I was using gnus to respond to email, it
looks like it messed things up for reasons probably having to do with
quoting. I don't think I've configured anything strange here, but who
knows. For now, I'll just use gmail to respond.
On Mon, Jan 22, 2024 at 11:11 PM Emanuel Berg <incal@dataswamp.org> wrote:
> Andrew Hyatt wrote:
>
> > [...] 1. From using gptel and ellama against the same
> > model, I see different style responses, and that
> > kind of inconsistency would be good to get a handle on;
> > LLMs are difficult enough to figure out re what they're
> > doing without this additional variation.
> >
> > Is this keeping the prompt and temperature constant? There's
> > inconsistency, though, even keeping everything constant due to
> > the randomness of the LLM. I often get very different
> > results, for example, to make the demo I shared, I had to run
> > it like 5 times because it would either do things too well (no
> > need to demo corrections), or not well enough (for example, it
> > wouldn't follow my orders to put everything in one paragraph).
> >
> > 2. Package LLM has the laudible goal of bridgeing between
> > models and front-ends, and this is going to be
> > vital. 3. (1,2) above lead to the following question:
> > 4. Can we write down a list of common configuration
> > vars --- here common across the model axis. Make it
> > a union of all such params. [...]
>
> Uhm, pardon me for asking but why are the e-mails looking
> like this?
>
> --
> underground experts united
> https://dataswamp.org/~incal
>
>
>
[-- Attachment #2: Type: text/html, Size: 2258 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
[not found] <m2il3mj961.fsf@gmail.com>
2024-01-22 18:50 ` LLM Experiments, Part 1: Corrections Sergey Kostyaev
@ 2024-01-23 1:36 ` João Távora
2024-01-23 4:17 ` T.V Raman
2024-01-23 19:19 ` Andrew Hyatt
2024-01-24 1:26 ` contact
` (2 subsequent siblings)
4 siblings, 2 replies; 22+ messages in thread
From: João Távora @ 2024-01-23 1:36 UTC (permalink / raw)
To: Andrew Hyatt; +Cc: emacs-devel, sskostyaev
On Mon, Jan 22, 2024 at 4:16 AM Andrew Hyatt <ahyatt@gmail.com> wrote:
>
>
> Hi everyone,
Hi Andrew,
I have some ideas to share, though keep in mind this is mainly
thinking out loud and I'm largely an LLM newbie.
> Question 1: Does the llm-flows.el file really belong in the llm
> package?
Maybe, but keep the functions isolated. I'd be interested in
a diff-mode flow which is different from this ediff-one you
demo. So it should be possible to build both.
The diff-mode flow I'm thinking of would be similar to the
diff option of LSP-proposed edits if your code, btw. See the
variable eglot-confirm-server-edits for an idea of the interface.
> Question 3: How should we deal with context? The code that has the
> text corrector doesn't include surrounding context (the text
> before and after the text to rewrite), but it usually is helpful.
> How much context should we add?
Karthik of gptel.el explained to me that this is one of
the biggest challenges of working with LLMs, and that GitHub
Copilot and other code-assistance tools work by sending
not only the region you're interested in having the LLM help you
with but also some auxiliary functions and context discovered
heuristically. This is potentially complex, and likely doesn't
belong in the your base llm.el but it should be possible to do
somehow with an application build on top of llm.el (Karthik
suggests tree-sitter or LSP's reference finding abilities to
discover what's nearest in terms of context).
In case noone mentinoed this already, i think a good logging
facility is essential. This could go in the base llm.el library.
I'm obviously biased towards my own jsonrpc.el logging facilities,
where a separate easy-to-find buffer for each JSON-RPC connection
lists all the JSON transport-level conversation details in a
consistent format. jsonrpc.el clients can also use those logging
facilities to output application-level details.
In an LLM library, I suppose the equivalent to JSON transport-level
details are the specific API calls to each provider, how it gathers
context, prompts, etc. Those would be distinct for each LLM.
A provider-agnosntic application built on top of llm.el's abstraction
could log in a much more consistent way.
So my main point regarding logging is that is should live in a
readable log buffer, so it's easy to piece together what happened
and debug. Representing JSON as pretty-printed plists is often
very practical in my experience (though a bit slow if loads of text
is to be printed).
Maybe these logging transcripts could even be used to produce
automated tests, in case there's a way to achieve any kind of
determinism with LLMs (not sure if there is).
Similarly to logging, it would be good to have some kind
of visual feedback of what context is being sent in each
LLM request. Like momentarily highlighting the regions
to be sent alongside the prompt. Sometimes that is
not feasible. So it could make sense to summarize that extra
context in a few lines shown in the minibuffer perhaps. Like
"lines 2..10 from foo.cpp\nlines42-420 from bar.cpp"
So just my 200c,
Good luck,
João
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-23 1:36 ` João Távora
@ 2024-01-23 4:17 ` T.V Raman
2024-01-23 19:19 ` Andrew Hyatt
1 sibling, 0 replies; 22+ messages in thread
From: T.V Raman @ 2024-01-23 4:17 UTC (permalink / raw)
To: João Távora; +Cc: Andrew Hyatt, emacs-devel, sskostyaev
These are good observations.
Since pretty much all the LLM ApIs take json, logging the json is the
most friction-free approach. This consistency will also help us monitor
the LLM traffic consistently to ensure that rogue clients dont lea
context one doesn't want leaked.
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-23 1:36 ` João Távora
2024-01-23 4:17 ` T.V Raman
@ 2024-01-23 19:19 ` Andrew Hyatt
1 sibling, 0 replies; 22+ messages in thread
From: Andrew Hyatt @ 2024-01-23 19:19 UTC (permalink / raw)
To: João Távora; +Cc: emacs-devel, sskostyaev
[-- Attachment #1: Type: text/plain, Size: 4823 bytes --]
On Mon, Jan 22, 2024 at 9:36 PM João Távora <joaotavora@gmail.com> wrote:
> On Mon, Jan 22, 2024 at 4:16 AM Andrew Hyatt <ahyatt@gmail.com> wrote:
> >
> >
> > Hi everyone,
>
> Hi Andrew,
>
> I have some ideas to share, though keep in mind this is mainly
> thinking out loud and I'm largely an LLM newbie.
>
> > Question 1: Does the llm-flows.el file really belong in the llm
> > package?
>
> Maybe, but keep the functions isolated. I'd be interested in
> a diff-mode flow which is different from this ediff-one you
> demo. So it should be possible to build both.
>
> The diff-mode flow I'm thinking of would be similar to the
> diff option of LSP-proposed edits if your code, btw. See the
> variable eglot-confirm-server-edits for an idea of the interface.
>
Great call-out, thanks. I'll check it out. I'm starting to get a better
idea of how this might all work out in my mind.
>
> > Question 3: How should we deal with context? The code that has the
> > text corrector doesn't include surrounding context (the text
> > before and after the text to rewrite), but it usually is helpful.
> > How much context should we add?
>
> Karthik of gptel.el explained to me that this is one of
> the biggest challenges of working with LLMs, and that GitHub
> Copilot and other code-assistance tools work by sending
> not only the region you're interested in having the LLM help you
> with but also some auxiliary functions and context discovered
> heuristically. This is potentially complex, and likely doesn't
> belong in the your base llm.el but it should be possible to do
> somehow with an application build on top of llm.el (Karthik
> suggests tree-sitter or LSP's reference finding abilities to
> discover what's nearest in terms of context).
>
Interesting idea - yes, this should be customizable, and it will be quite
complicated in some cases.
>
> In case noone mentinoed this already, i think a good logging
> facility is essential. This could go in the base llm.el library.
> I'm obviously biased towards my own jsonrpc.el logging facilities,
> where a separate easy-to-find buffer for each JSON-RPC connection
> lists all the JSON transport-level conversation details in a
> consistent format. jsonrpc.el clients can also use those logging
> facilities to output application-level details.
>
> In an LLM library, I suppose the equivalent to JSON transport-level
> details are the specific API calls to each provider, how it gathers
> context, prompts, etc. Those would be distinct for each LLM.
> A provider-agnosntic application built on top of llm.el's abstraction
> could log in a much more consistent way.
>
> So my main point regarding logging is that is should live in a
> readable log buffer, so it's easy to piece together what happened
> and debug. Representing JSON as pretty-printed plists is often
> very practical in my experience (though a bit slow if loads of text
> is to be printed).
>
Good feedback, mainly for what I've already released in the llm package so
far. JSON is useful for the initial request, but there's a lot of
streaming that happens,which isn't really valid JSON, although sometimes it
contains valid JSON. Sometimes it is just JSON, streaming a chunk at a
time. So in general you have to deal with that stuff as just pure text.
So far, setting url-debug to non-nil is sufficient for basic debugging, but
making a more standard and better logging facility would be very nice. I'll
work on it.
>
> Maybe these logging transcripts could even be used to produce
> automated tests, in case there's a way to achieve any kind of
> determinism with LLMs (not sure if there is).
>
Probably not, unfortunately, even if you can remove the randomness, little
changes are always happening with newer version of the model or the
processing around it. I do have included a fake LLM that can be used to
test whatever flow, though.
>
> Similarly to logging, it would be good to have some kind
> of visual feedback of what context is being sent in each
> LLM request. Like momentarily highlighting the regions
> to be sent alongside the prompt. Sometimes that is
> not feasible. So it could make sense to summarize that extra
> context in a few lines shown in the minibuffer perhaps. Like
> "lines 2..10 from foo.cpp\nlines42-420 from bar.cpp"
>
I like the idea of logging to the context. I think it might make sense to
just add that to the debug buffer instead of the minibuffer, though.
Hopefully things just work, and so the minibuffer would just show something
like it does in the demo, just saying it's sending things to whatever LLM
(perhaps a reference to the debug buffer would be nice, though).
>
> So just my 200c,
> Good luck,
> João
>
[-- Attachment #2: Type: text/html, Size: 6244 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
[not found] <m2il3mj961.fsf@gmail.com>
2024-01-22 18:50 ` LLM Experiments, Part 1: Corrections Sergey Kostyaev
2024-01-23 1:36 ` João Távora
@ 2024-01-24 1:26 ` contact
2024-01-24 4:17 ` T.V Raman
2024-01-24 14:55 ` Andrew Hyatt
2024-01-24 2:28 ` Karthik Chikmagalur
2024-05-20 17:28 ` Juri Linkov
4 siblings, 2 replies; 22+ messages in thread
From: contact @ 2024-01-24 1:26 UTC (permalink / raw)
To: Andrew Hyatt, emacs-devel; +Cc: sskostyaev
Hi Andrew,
Having worked on similar problems in gptel for about nine months now
(without much success), here are some thoughts.
> Question 1: Does the llm-flows.el file really belong in the llm
> package? It does help people code against llms, but it expands
> the scope of the llm package from being just about connecting to
> different LLMs to offering a higher level layer necessary for
> these more complicated flows. I think this probably does make
> sense, there's no need to have a separate package just for this
> one part.
I think llm-flows works better as an independent package for now,
since it's not clear yet what the "primitive" operations of working
with LLMs look like. I suspect these usage patterns will also be a
matter of preference for users, as I've realized from comparing how I
use gptel to the feature requests I get.
> Question 2: What's the best way to write these flows with multiple
> stages, in which some stages sometimes need to be repeated? It's
> kind of a state machine when you think about it, and there's a
> state machine GNU ELPA library already (fsm). I opted to not model
> it explicitly as a state machine, optimizing instead to just use
> the most straightforward code possible.
An FSM might be overkill here. At the same time, I'm not sure that
all possible interactions will fit this multi-step paradigm like
rewriting text does.
> Question 3: How should we deal with context? The code that has the
> text corrector doesn't include surrounding context (the text
> before and after the text to rewrite), but it usually is helpful.
> How much context should we add? The llm package does know about
> model token limits, but more tokens add more cost in terms of
> actual money (per/token billing for services, or just the CPU
> energy costs for local models). Having it be customizable makes
> sense to some extent, but users are not expected to have a good
> sense of how much context to include. My guess is that we should
> just have a small amount of context that won't be a problem for
> most models. But there's other questions as well when you think
> about context generally: How would context work in different
> modes? What about when context may spread in multiple files? It's
> a problem that I don't have any good insight into yet.
I see different questions here:
1. What should be the default amount of context included with
requests?
2. How should this context be determined? (single buffer, across
files etc)
3. How should this be different between modes of usage, and how
should this be communicated unambiguously?
4. Should we track token costs (when applicable) and communicate them
to the user?
Some lessons from gptel, which focuses mostly on a chat interface:
1. Users seem to understand gptel's model intuitively since they
think of it like a chat application, where the context is expected to
be everything in the buffer up to the cursor position. The only
addition is to use the region contents instead when the region is is
active. This default works well for more than chat, actually. It's
good enough when rewriting-in-place or for continuing your prose/code.
2. This is tricky, I don't have any workable ideas yet. In gptel
I've experimented with providing the options "full buffer" and "open
project buffers" in addition to the default, but these are both
overkill, expensive and rarely useful -- they often confuse the LLM
more than they help. Additionally, in Org mode documents I've
experimented with using sparse trees as the context -- this is
inexpensive and can work very well but the document has to be
(semantically) structured a certain way. This becomes obvious after a
couple of sessions, but the behavior has to be learned nevertheless.
3a. For coding projects I think it might be possible to construct a
"sparse tree" with LSP or via treesitter, and send (essentially) an
"API reference" along with smaller chunks of code. This should make
basic copilot-style usage viable. I don't use LSP or treesitter
seriously, so I don't know how to do this.
3b. Communicating this unambiguously to users is a UI design question,
and I can imagine many ways to do it.
4. I think optionally showing the cumulative token count for a
"session" (however defined) makes sense.
> Question 5: Should there be a standard set of user behaviors about
> editing the prompt? In another demo (one I'll send as a followup),
> with a universal argument, the user can edit the prompt, minus
> context and content (in this case the content is the text to
> correct). Maybe that should always be the case. However, that
> prompt can be long, perhaps a bit long for the minibuffer. Using a
> buffer instead seems like it would complicate the flow. Also, if
> the context and content is embedded in that prompt, they would
> have to be replaced with some placeholder. I think the prompt
> should always be editable, we should have some templating system.
> Perhaps emacs already has some templating system, and one that can
> pass arguments for number of tokens from context would be nice.
Another unsolved problem in gptel right now. Here's what it uses
currently:
- prompt: from the minibuffer
- context and content: selected region only
The main problem with including context separate from the content here
is actually not the UI, it's convincing the LLM to consistently
rewrite only the content and use the context as context. Using the
prompt+context as the "system message" works, but not all LLM APIs
provide a "system message" field.
> Question 6: How do we avoid having a ton of very specific
> functions for all the various ways that LLMs can be used? Besides
> correcting text, I could have had it expand it, summarize it,
> translate it, etc. Ellama offers all these things (but without the
> diff and other workflow-y aspects). I think these are too much for
> the user to remember.
Yes, this was the original reason I wrote gptel -- the first few
packages for LLM interaction (only GPT-3.5 back then) wanted to
prescribe the means of interaction via dedicated commands, which I
thought overwhelmed the user while also missing what makes LLMs
different from (say) language checkers like proselint and vale, and
from code refactoring tools.
> It'd be nice to have one function when the
> user wants to do something, and we work out what to do in the
> workflow. But the user shouldn't be developing the prompt
> themselves; at least at this point, it's kind of hard to just
> think of everything you need to think of in a good prompt. They
> need to be developed, updated, etc. What might be good is a system
> in which the user chooses what they want to do to a region as a
> secondary input, kind of like another kind of
> execute-extended-command.
I think having users type out their intention in natural language into
a prompt is fine -- the prompt can then be saved and added to a
persistent collection. We will never be able to cover
(programmatically) even a reasonable fraction of the things the user
might want to do.
The things the user might need help with is what I'd call "prompt
decoration". There are standard things you can specify in a prompt to
change the brevity and tone of a response. LLMs tend to generate
purple prose, summarize their responses, apologize or warn
excessively, etc. We can offer a mechanism to quick-add templates to
the prompt to stem these behaviors, or encourage other ones.
Karthik
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-24 1:26 ` contact
@ 2024-01-24 4:17 ` T.V Raman
2024-01-24 15:00 ` Andrew Hyatt
2024-01-24 14:55 ` Andrew Hyatt
1 sibling, 1 reply; 22+ messages in thread
From: T.V Raman @ 2024-01-24 4:17 UTC (permalink / raw)
To: contact; +Cc: Andrew Hyatt, emacs-devel, sskostyaev
All very good points, Kartik!
Some related thoughts below:
1. I think we should for now treat prose-rewriting vs code-rewriting as
separate flows -- but that said, limit our types of "flows"
to 2. More might emerge over time, but it's too early.
2. Multi-step flows with LLMs are still early -- or feel early to me; I
think that for now, we should just have human-in-the-loop at each
step, but then leverage the power of Emacs to help the user stay
efficient in the human-in-the-loop step, start with simple things
like putting point and mark in the right place, populate Emacs
completions with the right choices etc.
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-24 4:17 ` T.V Raman
@ 2024-01-24 15:00 ` Andrew Hyatt
2024-01-24 15:14 ` T.V Raman
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Hyatt @ 2024-01-24 15:00 UTC (permalink / raw)
To: T.V Raman; +Cc: contact, emacs-devel, sskostyaev
On Tue, Jan 23, 2024 at 08:17 PM "T.V Raman" <raman@google.com> wrote:
> All very good points, Kartik!
>
> Some related thoughts below:
>
> 1. I think we should for now treat prose-rewriting vs code-rewriting as
> separate flows -- but that said, limit our types of "flows"
> to 2. More might emerge over time, but it's too early.
How do you see the code and prose rewriting requiring different UI or processing?
> 2. Multi-step flows with LLMs are still early -- or feel early to me; I
> think that for now, we should just have human-in-the-loop at each
> step, but then leverage the power of Emacs to help the user stay
> efficient in the human-in-the-loop step, start with simple things
> like putting point and mark in the right place, populate Emacs
> completions with the right choices etc.
It can't be at every step, though. Maybe you wouldn't consider this a
step, but in my next demo, one step is to get JSON from the LLM, which
requires parsing out the JSON (which tends to be either the entire
response, or often in a markdown block, or if none of the above, we
retry a certain amount of times). But agreed that in general that we do
want humans to be in control, especially when things get complicated.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-24 15:00 ` Andrew Hyatt
@ 2024-01-24 15:14 ` T.V Raman
0 siblings, 0 replies; 22+ messages in thread
From: T.V Raman @ 2024-01-24 15:14 UTC (permalink / raw)
To: ahyatt; +Cc: raman, contact, emacs-devel, sskostyaev
1. code rewrite and prose rewrite just feel very different to me --
starting with simple things like white-space formatting etc.
2. Code rewrites therefore require a different type of mental activity
-- side-by-side diff, whereas prose rewrite are more about has the
meaning being preserved -- and that is not conveyed by ws as directly.
3. You're likely right about js parsing and follow-on steps as being
"atomic" actions in some sense from the perspective of using AI as
a tool, but I still feel it too early to connect too many steps
into one because it happens to work sometimes at present; it'll
likely both get better and change, so we might end up abstracting
early and perhaps erroneously at this stage. So if you do
pool/group steps -- eep that an implementation detail.
Andrew Hyatt writes:
> On Tue, Jan 23, 2024 at 08:17 PM "T.V Raman" <raman@google.com> wrote:
>
> > All very good points, Kartik!
> >
> > Some related thoughts below:
> >
> > 1. I think we should for now treat prose-rewriting vs code-rewriting as
> > separate flows -- but that said, limit our types of "flows"
> > to 2. More might emerge over time, but it's too early.
>
> How do you see the code and prose rewriting requiring different UI or processing?
>
> > 2. Multi-step flows with LLMs are still early -- or feel early to me; I
> > think that for now, we should just have human-in-the-loop at each
> > step, but then leverage the power of Emacs to help the user stay
> > efficient in the human-in-the-loop step, start with simple things
> > like putting point and mark in the right place, populate Emacs
> > completions with the right choices etc.
>
> It can't be at every step, though. Maybe you wouldn't consider this a
> step, but in my next demo, one step is to get JSON from the LLM, which
> requires parsing out the JSON (which tends to be either the entire
> response, or often in a markdown block, or if none of the above, we
> retry a certain amount of times). But agreed that in general that we do
> want humans to be in control, especially when things get complicated.
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
2024-01-24 1:26 ` contact
2024-01-24 4:17 ` T.V Raman
@ 2024-01-24 14:55 ` Andrew Hyatt
1 sibling, 0 replies; 22+ messages in thread
From: Andrew Hyatt @ 2024-01-24 14:55 UTC (permalink / raw)
To: contact; +Cc: emacs-devel, sskostyaev
Thanks for this super useful response. BTW, I'm going to try again to
use gnus to respond, after making some changes, so apologies if the
formatting goes awry. If it does, I'll re-respond in gmail.
On Tue, Jan 23, 2024 at 05:26 PM contact@karthinks.com wrote:
> Hi Andrew,
>
> Having worked on similar problems in gptel for about nine months now
> (without much success), here are some thoughts.
>
>> Question 1: Does the llm-flows.el file really belong in the llm
>> package? It does help people code against llms, but it expands
>> the scope of the llm package from being just about connecting to
>> different LLMs to offering a higher level layer necessary for
>> these more complicated flows. I think this probably does make
>> sense, there's no need to have a separate package just for this
>> one part.
>
> I think llm-flows works better as an independent package for now,
> since it's not clear yet what the "primitive" operations of working
> with LLMs look like. I suspect these usage patterns will also be a
> matter of preference for users, as I've realized from comparing how I
> use gptel to the feature requests I get.
I agree that the more we assume specific patterns (which indeed seems to
be where things are going), the more it should go into its own package
(or maybe be part of ellama or something). But there may be some
commonality that are just useful regardless of the usage patterns. I
think we'll have to see how this plays out.
>> Question 2: What's the best way to write these flows with multiple
>> stages, in which some stages sometimes need to be repeated? It's
>> kind of a state machine when you think about it, and there's a
>> state machine GNU ELPA library already (fsm). I opted to not model
>> it explicitly as a state machine, optimizing instead to just use
>> the most straightforward code possible.
>
> An FSM might be overkill here. At the same time, I'm not sure that
> all possible interactions will fit this multi-step paradigm like
> rewriting text does.
>
>> Question 3: How should we deal with context? The code that has the
>> text corrector doesn't include surrounding context (the text
>> before and after the text to rewrite), but it usually is helpful.
>> How much context should we add? The llm package does know about
>> model token limits, but more tokens add more cost in terms of
>> actual money (per/token billing for services, or just the CPU
>> energy costs for local models). Having it be customizable makes
>> sense to some extent, but users are not expected to have a good
>> sense of how much context to include. My guess is that we should
>> just have a small amount of context that won't be a problem for
>> most models. But there's other questions as well when you think
>> about context generally: How would context work in different
>> modes? What about when context may spread in multiple files? It's
>> a problem that I don't have any good insight into yet.
>
> I see different questions here:
>
> 1. What should be the default amount of context included with
> requests?
> 2. How should this context be determined? (single buffer, across
> files etc)
> 3. How should this be different between modes of usage, and how
> should this be communicated unambiguously?
> 4. Should we track token costs (when applicable) and communicate them
> to the user?
>
> Some lessons from gptel, which focuses mostly on a chat interface:
>
> 1. Users seem to understand gptel's model intuitively since they
> think of it like a chat application, where the context is expected to
> be everything in the buffer up to the cursor position. The only
> addition is to use the region contents instead when the region is is
> active. This default works well for more than chat, actually. It's
> good enough when rewriting-in-place or for continuing your prose/code.
I think it may be still useful to use the context even when modifying
the region, though.
>
> 2. This is tricky, I don't have any workable ideas yet. In gptel
> I've experimented with providing the options "full buffer" and "open
> project buffers" in addition to the default, but these are both
> overkill, expensive and rarely useful -- they often confuse the LLM
> more than they help. Additionally, in Org mode documents I've
> experimented with using sparse trees as the context -- this is
> inexpensive and can work very well but the document has to be
> (semantically) structured a certain way. This becomes obvious after a
> couple of sessions, but the behavior has to be learned nevertheless.
>
> 3a. For coding projects I think it might be possible to construct a
> "sparse tree" with LSP or via treesitter, and send (essentially) an
> "API reference" along with smaller chunks of code. This should make
> basic copilot-style usage viable. I don't use LSP or treesitter
> seriously, so I don't know how to do this.
>
> 3b. Communicating this unambiguously to users is a UI design question,
> and I can imagine many ways to do it.
Thanks, this is very useful. My next demo is with org-mode, and there
I'm currently sending the tree (just the structure) as context.
There's a meta-question here is that we don't know how what the right
thing to do is, because we don't have a quality process. If we were
doing this seriously, we'd want a corpus of examples to test with, and a
way to judge quality (LLMs can actually do this as well). The fact that
people could be running any model is a significant complicating factor.
But as it is, we just have our own anecdotal evidences, and reasonable
hunches. As these things get better, I think it would converge to what
we would think is reasonable context anyway.
>
> 4. I think optionally showing the cumulative token count for a
> "session" (however defined) makes sense.
The token count should only be applicable to the current LLM count. My
understanding is that the machinery around the LLMs is adding the
previous conversation, in whole, or in summary, as context. So it's
really hard to understand the actual token usage once you start having
conversations. That's fine, though, since most tokens are used in the
first message where context appears, and subsequent rounds are fairly
light.
>
>> Question 5: Should there be a standard set of user behaviors about
>> editing the prompt? In another demo (one I'll send as a followup),
>> with a universal argument, the user can edit the prompt, minus
>> context and content (in this case the content is the text to
>> correct). Maybe that should always be the case. However, that
>> prompt can be long, perhaps a bit long for the minibuffer. Using a
>> buffer instead seems like it would complicate the flow. Also, if
>> the context and content is embedded in that prompt, they would
>> have to be replaced with some placeholder. I think the prompt
>> should always be editable, we should have some templating system.
>> Perhaps emacs already has some templating system, and one that can
>> pass arguments for number of tokens from context would be nice.
>
> Another unsolved problem in gptel right now. Here's what it uses
> currently:
>
> - prompt: from the minibuffer
> - context and content: selected region only
>
> The main problem with including context separate from the content here
> is actually not the UI, it's convincing the LLM to consistently
> rewrite only the content and use the context as context. Using the
> prompt+context as the "system message" works, but not all LLM APIs
> provide a "system message" field.
Yes, the LLM library already separates the "system message" (which we
call context, which I now realize is a bit confusing), and each provider
just deals with it in the best way possible. Hopefully as LLM
instruction following gets better, it will stop treating context as
anything other than context. In the end, IIUC, it all ends up in the
same place when feeding text to the LLM anyway.
>
>> Question 6: How do we avoid having a ton of very specific
>> functions for all the various ways that LLMs can be used? Besides
>> correcting text, I could have had it expand it, summarize it,
>> translate it, etc. Ellama offers all these things (but without the
>> diff and other workflow-y aspects). I think these are too much for
>> the user to remember.
>
> Yes, this was the original reason I wrote gptel -- the first few
> packages for LLM interaction (only GPT-3.5 back then) wanted to
> prescribe the means of interaction via dedicated commands, which I
> thought overwhelmed the user while also missing what makes LLMs
> different from (say) language checkers like proselint and vale, and
> from code refactoring tools.
>
>> It'd be nice to have one function when the
>> user wants to do something, and we work out what to do in the
>> workflow. But the user shouldn't be developing the prompt
>> themselves; at least at this point, it's kind of hard to just
>> think of everything you need to think of in a good prompt. They
>> need to be developed, updated, etc. What might be good is a system
>> in which the user chooses what they want to do to a region as a
>> secondary input, kind of like another kind of
>> execute-extended-command.
>
> I think having users type out their intention in natural language into
> a prompt is fine -- the prompt can then be saved and added to a
> persistent collection. We will never be able to cover
> (programmatically) even a reasonable fraction of the things the user
> might want to do.
Agreed. Increasingly, it seems like a really advanced prompt management
system might be necessary. How to do that is it's own separate set of
questions. How should they be stored? Are these variables? Files?
Data in sqlite?
>
> The things the user might need help with is what I'd call "prompt
> decoration". There are standard things you can specify in a prompt to
> change the brevity and tone of a response. LLMs tend to generate
> purple prose, summarize their responses, apologize or warn
> excessively, etc. We can offer a mechanism to quick-add templates to
> the prompt to stem these behaviors, or encourage other ones.
Agreed. I have a prompting system in my ekg project, and it allows
transcluding other prompts, which is quite useful. So you might want to
just include information about your health, or your project, or even
more dynamic things like the date, org agenda for the day, or whatever.
This is powerful and a good thing to include in this future prompt
management system.
>
> Karthik
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
[not found] <m2il3mj961.fsf@gmail.com>
` (2 preceding siblings ...)
2024-01-24 1:26 ` contact
@ 2024-01-24 2:28 ` Karthik Chikmagalur
2024-05-20 17:28 ` Juri Linkov
4 siblings, 0 replies; 22+ messages in thread
From: Karthik Chikmagalur @ 2024-01-24 2:28 UTC (permalink / raw)
To: Andrew Hyatt, emacs-devel; +Cc: sskostyaev
Hi Andrew,
Having worked on similar problems in gptel for about nine months now
(without much success), here are some thoughts.
> Question 1: Does the llm-flows.el file really belong in the llm
> package? It does help people code against llms, but it expands
> the scope of the llm package from being just about connecting to
> different LLMs to offering a higher level layer necessary for
> these more complicated flows. I think this probably does make
> sense, there's no need to have a separate package just for this
> one part.
I think llm-flows works better as an independent package for now,
since it's not clear yet what the "primitive" operations of working
with LLMs look like. I suspect these usage patterns will also be a
matter of preference for users, as I've realized from comparing how I
use gptel to the feature requests I get.
> Question 2: What's the best way to write these flows with multiple
> stages, in which some stages sometimes need to be repeated? It's
> kind of a state machine when you think about it, and there's a
> state machine GNU ELPA library already (fsm). I opted to not model
> it explicitly as a state machine, optimizing instead to just use
> the most straightforward code possible.
An FSM might be overkill here. At the same time, I'm not sure that
all possible interactions will fit this multi-step paradigm like
rewriting text does.
> Question 3: How should we deal with context? The code that has the
> text corrector doesn't include surrounding context (the text
> before and after the text to rewrite), but it usually is helpful.
> How much context should we add? The llm package does know about
> model token limits, but more tokens add more cost in terms of
> actual money (per/token billing for services, or just the CPU
> energy costs for local models). Having it be customizable makes
> sense to some extent, but users are not expected to have a good
> sense of how much context to include. My guess is that we should
> just have a small amount of context that won't be a problem for
> most models. But there's other questions as well when you think
> about context generally: How would context work in different
> modes? What about when context may spread in multiple files? It's
> a problem that I don't have any good insight into yet.
I see different questions here:
1. What should be the default amount of context included with
requests?
2. How should this context be determined? (single buffer, across
files etc)
3. How should this be different between modes of usage, and how
should this be communicated unambiguously?
4. Should we track token costs (when applicable) and communicate them
to the user?
Some lessons from gptel, which focuses mostly on a chat interface:
1. Users seem to understand gptel's model intuitively since they
think of it like a chat application, where the context is expected to
be everything in the buffer up to the cursor position. The only
addition is to use the region contents instead when the region is is
active. This default works well for more than chat, actually. It's
good enough when rewriting-in-place or for continuing your prose/code.
2. This is tricky, I don't have any workable ideas yet. In gptel
I've experimented with providing the options "full buffer" and "open
project buffers" in addition to the default, but these are both
overkill, expensive and rarely useful -- they often confuse the LLM
more than they help. Additionally, in Org mode documents I've
experimented with using sparse trees as the context. This is
inexpensive and can work very well but the document has to be
(semantically) structured a certain way. This becomes obvious after a
couple of sessions, but the behavior has to be learned nevertheless.
3. For coding projects I think it might be possible to construct a
"sparse tree" with LSP or via treesitter, and send (essentially) an
"API reference" along with smaller chunks of code. This should make
copilot-style usage viable. I don't use LSP or treesitter seriously,
so I don't know how to do this.
3b. Communicating this unambiguously to users is a UI design question,
and I can imagine many ways to do it.
4. I think optionally showing the cumulative token count for a
"session" (however defined) makes sense.
> Question 5: Should there be a standard set of user behaviors about
> editing the prompt? In another demo (one I'll send as a followup),
> with a universal argument, the user can edit the prompt, minus
> context and content (in this case the content is the text to
> correct). Maybe that should always be the case. However, that
> prompt can be long, perhaps a bit long for the minibuffer. Using a
> buffer instead seems like it would complicate the flow. Also, if
> the context and content is embedded in that prompt, they would
> have to be replaced with some placeholder. I think the prompt
> should always be editable, we should have some templating system.
> Perhaps emacs already has some templating system, and one that can
> pass arguments for number of tokens from context would be nice.
Another unsolved problem in gptel right now. Here's what it uses
currently:
- prompt: from the minibuffer
- context and content: selected region only
The main problem with including context separate from the content here
is actually not the UI, it's convincing the LLM to consistently
rewrite only the content and use the context as context.
> Question 6: How do we avoid having a ton of very specific
> functions for all the various ways that LLMs can be used? Besides
> correcting text, I could have had it expand it, summarize it,
> translate it, etc. Ellama offers all these things (but without the
> diff and other workflow-y aspects). I think these are too much for
> the user to remember.
Yes, this was the original reason I wrote gptel -- the first few
packages for LLM interaction (only GPT-3.5 back then) wanted to
prescribe the means of interaction via dedicated commands, which I
thought overwhelmed the user while also missing what makes LLMs
different from (say) language checkers like proselint and vale, and
from code refactoring tools.
> It'd be nice to have one function when the
> user wants to do something, and we work out what to do in the
> workflow. But the user shouldn't be developing the prompt
> themselves; at least at this point, it's kind of hard to just
> think of everything you need to think of in a good prompt. They
> need to be developed, updated, etc. What might be good is a system
> in which the user chooses what they want to do to a region as a
> secondary input, kind of like another kind of
> execute-extended-command.
I think having users type out their intention in natural language into
a prompt is fine -- the prompt can then be saved and added to a
persistent collection. We will never be able to cover
(programmatically) even a reasonable fraction of the things the user
might want to do.
The things the user might need help with is what I'd call "prompt
decoration". There are standard things you can specify in a prompt to
change the brevity and tone of a response. LLMs tend to generate
purple prose, summarize their responses, apologize or warn
excessively, etc. We can offer a mechanism to quick-add templates to
the prompt to stem these behaviors, or encourage other ones.
Karthik
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: LLM Experiments, Part 1: Corrections
[not found] <m2il3mj961.fsf@gmail.com>
` (3 preceding siblings ...)
2024-01-24 2:28 ` Karthik Chikmagalur
@ 2024-05-20 17:28 ` Juri Linkov
4 siblings, 0 replies; 22+ messages in thread
From: Juri Linkov @ 2024-05-20 17:28 UTC (permalink / raw)
To: Andrew Hyatt; +Cc: emacs-devel
> Question 3: How should we deal with context? The code that has the text
> corrector doesn't include surrounding context (the text before and after
> the text to rewrite), but it usually is helpful. How much context should we
> add? The llm package does know about model token limits, but more tokens
> add more cost in terms of actual money (per/token billing for services, or
> just the CPU energy costs for local models). Having it be customizable
> makes sense to some extent, but users are not expected to have a good sense
> of how much context to include. My guess is that we should just have
> a small amount of context that won't be a problem for most models. But
> there's other questions as well when you think about context generally: How
> would context work in different modes? What about when context may spread
> in multiple files? It's a problem that I don't have any good insight into
> yet.
I suppose you are already familiar with different methods
of obtaining the relevant context used by copilot:
https://github.blog/2023-05-17-how-github-copilot-is-getting-better-at-understanding-your-code/
https://thakkarparth007.github.io/copilot-explorer/posts/copilot-internals.html
etc. where tabs correspond to Emacs buffers, so in Emacs the context
could be collected from existing buffers like e.g. dabbrev-expand does.
^ permalink raw reply [flat|nested] 22+ messages in thread