Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM #6634

Open
1 task done
amirshawn opened this issue Feb 6, 2025 · 16 comments
Labels
bug Something isn't working severity:medium Affecting multiple users

Comments

@amirshawn
Copy link

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

I am working on a multi agent system and this message that keeps popping up is really messing with my flow, here's a couple examples:

Trimming prompt to meet context window limitations

I'll help trim the prompts to make them more concise while keeping the essential information. Let me update the agent prompts:

Edited the contents of a file
Edited the contents of a file
Edited the contents of a file
Now let's run the test with the trimmed prompts:

Ran a bash command

Then this:

Trimming prompt to meet context window limitations

Yes, I see the issue. The prompts are too long and causing context window limitations. Let me update the prompts to be more concise:

Edited the contents of a file
undo that, it was a system message, not from me.

I'll undo the change:

Edited the contents of a file
Let me check what's happening with the process. It seems to be stuck at the risk review step. Let me check the risk review prompt:

Read the contents of a file
Read the contents of a file
I see the issue - the risk review prompt is too long. Let me trim it:

Edited the contents of a file
Now let me also trim the strategy update prompt:

Edited the contents of a file

I thought the context compression was an option but I haven't enabled it or know how to enable or disable it.

It definitely should not be messaging the LLM in the middle of our conversation like that. If there's a way to make silent messages that would be ok but adding it into the conversation is not good.

OpenHands Installation

Docker command in README

OpenHands Version

0.23

Operating System

MacOS

Logs, Errors, Screenshots, and Additional Context

No response

@enyst
Copy link
Collaborator

enyst commented Feb 6, 2025

Thank you for the report @amirshawn !

(cc: @csmith49) this must be the observation we added when the context window errored out. So during the run, the context was exceeded. It isn't context compression exactly, the only other option there would have been for a ContextWindowExceededError to stop the agent (and give ERROR state).

I wonder if we should use an empty message string instead?

Out of curiosity @amirshawn what LLM were you using?

@mamoodi mamoodi added the severity:medium Affecting multiple users label Feb 6, 2025
@amirshawn
Copy link
Author

I was using Anthropic Claude Sonnet 3.5, the recommended model. I think an empty message would be totally fine or even something like, System Maintenance in progresss, no action needed. Or anything that doesn't make it take an action.

@amirshawn
Copy link
Author

@enyst do you know what is happening when I see this message Trimming prompt to meet context window limitations
Is it actually trimming the prompt? Is there a way to opt out of that? Wouldn't the oldest messages drop out without having to trim the prompt?

@amirshawn
Copy link
Author

I notice that as the conversation gets longer this message is happening more and more often. I notice the errors:

09:48:20 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #1 | You can customize retry values in the configuration.
09:48:36 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #2 | You can customize retry values in the configuration.

they seem to coincide with those messages. Which before this change a lot of times would cause the conversation to stall. I would have to respond with ok or something similar to keep things moving. I'm just wondering if it's actually changing the prompt when we see that message?

Thank you!

@amirshawn
Copy link
Author

@enyst @csmith49 do you think you could make it give an empty response in the next version? I am about to start working on more prompts this week and it was quite the nightmare having it condense my prompts every 10 minutes.

@enyst
Copy link
Collaborator

enyst commented Feb 10, 2025

@enyst do you know what is happening when I see this message Trimming prompt to meet context window limitations Is it actually trimming the prompt? Is there a way to opt out of that? Wouldn't the oldest messages drop out without having to trim the prompt?

Oh, sorry @amirshawn , for some reason github didn't show me a notification for your reply (they're always enabled, but occasionally it skips some, no idea why!)

No, please don't worry about that. It doesn't trim the prompt. It only drops the first half of the messages, except the first user message. We assume the first user message is the task, so that one is always kept.

@enyst
Copy link
Collaborator

enyst commented Feb 10, 2025

We'll try your empty suggestion. I think it should work, too, yes.

@enyst
Copy link
Collaborator

enyst commented Feb 11, 2025

09:48:20 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #1 | You can customize retry values in the configuration.
09:48:36 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #2 | You can customize retry values in the configuration.

Just to clarify, openhands doesn't do anything to the context, related to these. No change there.

But it's normal that you saw them after the trimming happened, because what must have happened is like:

  • Sonnet got to 200k, which is the maximum context window of this model
  • we dropped some oldest messages from the context window (except the first user message, we keep that), and just send the rest
  • there's no guarantee or check how many tokens that means, but let's say, for the sake of the example, that the remaining messages were approximately 100k
  • so 4 requests to the Anthropic API in one minute would hit their RateLimit of 400k per minute.

In this case, openhands will wait a bit, then retry the request. If it still happens, it will wait a bit more, then retry. Hopefully the minute will expire/end by the time it retries, so then Anthropic will allow the request.

We really need a better solution there too. 🤔

@csmith49
Copy link
Collaborator

I've added a quick PR that removes the messages. The truncation behavior should be unchanged: when we hit a max token limit, we drop the oldest half of the events.

From a UX perspective, do y'all think this truncation behavior is something the user should be made aware of? Right now we'll do the truncation and nobody will be the wiser. I think that's fine for normal usage, but for folks experimenting unobserved state changes are probably not welcome.

@amirshawn
Copy link
Author

@csmith49 I think it's definitely something we should be able to opt out of. Has it always worked this way? How does this affect the prompt cache benefits? Also, there must be away of alerting the user when it occurs without interfering with the conversation, maybe using a toast alert or something. I am curious, if we are using prompt caching and we are within anthropic's 5 minute limit, do we send the full conversation each time? I would think we would be sending it just the new message and referencing the conversation or previous messages somehow? If that is the case, maybe only do the condensing when we've past that 5 minute limit. Since only at that point does it really have a benefit. I personally think what would be powerful is if you realize the conversation has gone a bit off track if there was a way to look at the current context and select yourself which parts of the conversation you'd like to keep. That way you could get the conversation back to relavence without having to completely restart the conversation. Especially if there was a way to sort the messages by size. Then maybe eventually having an agent that is trained to look at the full conversation, then removes unnecessary context like logs. As far as a better way of handling hitting the max, I would think throttling the message sending based on the known context limit of the llm would make sense.

@amirshawn
Copy link
Author

Reading about Anthropic's prompt caching, I guess I'm a little off on my understanding of how it works. I'll have to give this some more thought. I do still think that having a way to manually clean up the context could have a huge benefit. For instance if you're working through a problem and it takes many changes to finally get everything working. There's no need to hold on to the whole process of getting to that point. If there was a away to for instance click "optimize context" then use gpt-4o-mini or which ever model you choose (I figured something cheap) to look at the whole conversation and identify unnecessary parts. If each message had an id and you sent a prompt that asked it to identify the unnecessary parts and return their id's. Then all those messages would be selected and the user would then be able to uncheck any parts of the conversation it may want to keep. Does that make sense? I think over time the whole process would be optimized and improved hopefully to a point where it doesn't even need a person to oversee it or that part would be optional. I think cutting the context in half is like using an axe for surgery instead of a scalpel. I appologize for my rambling and most likely misunderstanding how this all works. I hope my feedback is at least valuable in some way!

@enyst
Copy link
Collaborator

enyst commented Feb 12, 2025

I think it's definitely something we should be able to opt out of. Has it always worked this way?

Just to clarify, we are talking about multiple methods in this topic; but this one is only one of them, and it's just the edge case: when the conversation hits maximum context window. Maximum context window is not per unit of time (it's not the rate limits), but it's the upper bound for each model (200k for sonnet). Once you hit it, you would not be able to do anything in that conversation, you'd get a fatal error from anthropic API and that's it, you'd need to a start a new session.

In that case, to avoid the error, and continue the conversation, we drop the first half of messages, sending only the other half. It hasn't always worked this way (before, the user got an error and was forced to restart). It works this way for quite some time now, starting from here, so like three months.

@csmith49
Copy link
Collaborator

I think it's definitely something we should be able to opt out of. Has it always worked this way?

The context truncation has worked this way for a long while, but the notification is new.

How does this affect the prompt cache benefits?

When a truncation happens, we temporarily lose out on the cache benefits. But the cache is rebuilt and is still applicable until the next truncation.

Also, there must be away of alerting the user when it occurs without interfering with the conversation, maybe using a toast alert or something.

Some kind of notification would be good, I agree. The mentioned PR keeps the message in the chat but it is "invisible" to the LLM. Confusing UX, but maybe a fine compromise in the short term.

I am curious, if we are using prompt caching and we are within anthropic's 5 minute limit, do we send the full conversation each time? I would think we would be sending it just the new message and referencing the conversation or previous messages somehow? If that is the case, maybe only do the condensing when we've past that 5 minute limit. Since only at that point does it really have a benefit.

My understanding is that we send the full conversation each time, and the "referencing previous message/conversation" is handled by the API. But I honestly don't know that for certain.

I personally think what would be powerful is if you realize the conversation has gone a bit off track if there was a way to look at the current context and select yourself which parts of the conversation you'd like to keep. That way you could get the conversation back to relavence without having to completely restart the conversation. Especially if there was a way to sort the messages by size. Then maybe eventually having an agent that is trained to look at the full conversation, then removes unnecessary context like logs.

That's an interesting idea. I don't know how much effort that would take to build out the UX but I expect it's non-trivial.

The actual truncation strategy is something we're actively testing: we have a separate LLM with a custom prompt that looks at the events and picks a subset to keep and produces a small summary of the discarded events. (The implementation is in PR #6597).

As far as a better way of handling hitting the max, I would think throttling the message sending based on the known context limit of the llm would make sense.

We do use the model's context limit, we just have to wait for the API to tell us we hit it. Are you suggesting we compute the context size before we send any message to the LLM? That's possible in some cases, and resolving issue #6707 might take this approach.

One sticky bit is somehow handling the fact that some models with completion interfaces don't have separate tokenizers we can call. We can support custom tokenizers, but we also need a strategy for dealing with the imprecision of the mismatch (maybe truncating when we hit 90% of the context window size?).

@csmith49
Copy link
Collaborator

I appologize for my rambling and most likely misunderstanding how this all works. I hope my feedback is at least valuable in some way!

The feedback is definitely useful! I think you might enjoy looking at the implementation of the LLMSummaryCondenser from #6597, it's the closest thing we have to your suggestion and we could definitely use some help improving the prompt and selection strategy.

@amirshawn
Copy link
Author

I think it's definitely something we should be able to opt out of. Has it always worked this way?

Just to clarify, we are talking about multiple methods in this topic; but this one is only one of them, and it's just the edge case: when the conversation hits maximum context window. Maximum context window is not per unit of time (it's not the rate limits), but it's the upper bound for each model (200k for sonnet). Once you hit it, you would not be able to do anything in that conversation, you'd get a fatal error from anthropic API and that's it, you'd need to a start a new session.

In that case, to avoid the error, and continue the conversation, we drop the first half of messages, sending only the other half. It hasn't always worked this way (before, the user got an error and was forced to restart). It works this way for quite some time now, starting from here, so like three months.

I didn't realize that! Well that clears up a lot in my thinking. So a powerful condenser is extremely important.

@SupaMic
Copy link

SupaMic commented Feb 27, 2025

I'm not sure how the limit is detected but I'm running into a context limit on Groq and getting error...
BadRequestError: litellm.BadRequestError: GroqException - {"error":{"message":"Please reduce the length of the messages or completion.","type":"invalid_request_error","param":"messages","code":"context_length_exceeded"}}
This is upon switching models within the session so this is the first message for this particular model 'DeepSeek R1 Distill Qwen 32B 128k' which has a 200k Token Per Minute limit on my tier. I have the "memory condensation" setting on.

I love the idea of an automated LLMSummaryCondenser which I could direct. I currently use a continually updated README.md within the codebase so I can scrap a session and use it as a summary to get the Agent back up to speed in a new session.

Until something like that exists reliably or we have a robust way to handle context limit errors from all APIs, I think the best way to handle this issue would be to allow raw access to the context for manual editing. Since we can already export the "trajectory" data, similarily we would just need an export of the context being sent (I think just the the model "messages" within the trajectory) , then we could edit that large context text manually and then add an import option/button to replace the current context. This would also solve the issue of contexts becoming too long for certain smaller models to handle and we could remove large sections of prompt which have become irrelevant because of later decisions that change the strategy.

So we dont need an text editing window, just an import button where we could replace the current chat context. Labeled revision history may be helpful feature on this if we wanted to have specific contexts for different areas of an app.

In any case, it would also be good to have some insight into what files from a workspace are being included by the Agents in the context as I suspect they are oversharing in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working severity:medium Affecting multiple users
Projects
None yet
Development

No branches or pull requests

5 participants