[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM #6634

amirshawn · 2025-02-06T10:33:12Z

Is there an existing issue for the same bug?

I have checked the existing issues.

Describe the bug and reproduction steps

I am working on a multi agent system and this message that keeps popping up is really messing with my flow, here's a couple examples:

Trimming prompt to meet context window limitations

I'll help trim the prompts to make them more concise while keeping the essential information. Let me update the agent prompts:

Edited the contents of a file
Edited the contents of a file
Edited the contents of a file
Now let's run the test with the trimmed prompts:

Ran a bash command

Then this:

Trimming prompt to meet context window limitations

Yes, I see the issue. The prompts are too long and causing context window limitations. Let me update the prompts to be more concise:

Edited the contents of a file
undo that, it was a system message, not from me.

I'll undo the change:

Edited the contents of a file
Let me check what's happening with the process. It seems to be stuck at the risk review step. Let me check the risk review prompt:

Read the contents of a file
Read the contents of a file
I see the issue - the risk review prompt is too long. Let me trim it:

Edited the contents of a file
Now let me also trim the strategy update prompt:

Edited the contents of a file

I thought the context compression was an option but I haven't enabled it or know how to enable or disable it.

It definitely should not be messaging the LLM in the middle of our conversation like that. If there's a way to make silent messages that would be ok but adding it into the conversation is not good.

OpenHands Installation

Docker command in README

OpenHands Version

0.23

Operating System

MacOS

Logs, Errors, Screenshots, and Additional Context

No response

Fixes All-Hands-AI#6634

enyst · 2025-02-06T14:18:38Z

Thank you for the report @amirshawn !

(cc: @csmith49) this must be the observation we added when the context window errored out. So during the run, the context was exceeded. It isn't context compression exactly, the only other option there would have been for a ContextWindowExceededError to stop the agent (and give ERROR state).

I wonder if we should use an empty message string instead?

Out of curiosity @amirshawn what LLM were you using?

amirshawn · 2025-02-06T19:01:38Z

I was using Anthropic Claude Sonnet 3.5, the recommended model. I think an empty message would be totally fine or even something like, System Maintenance in progresss, no action needed. Or anything that doesn't make it take an action.

amirshawn · 2025-02-09T09:53:17Z

@enyst do you know what is happening when I see this message Trimming prompt to meet context window limitations
Is it actually trimming the prompt? Is there a way to opt out of that? Wouldn't the oldest messages drop out without having to trim the prompt?

amirshawn · 2025-02-09T10:02:38Z

I notice that as the conversation gets longer this message is happening more and more often. I notice the errors:

09:48:20 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #1 | You can customize retry values in the configuration.
09:48:36 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #2 | You can customize retry values in the configuration.

they seem to coincide with those messages. Which before this change a lot of times would cause the conversation to stall. I would have to respond with ok or something similar to keep things moving. I'm just wondering if it's actually changing the prompt when we see that message?

Thank you!

amirshawn · 2025-02-10T23:20:51Z

@enyst @csmith49 do you think you could make it give an empty response in the next version? I am about to start working on more prompts this week and it was quite the nightmare having it condense my prompts every 10 minutes.

enyst · 2025-02-10T23:53:13Z

@enyst do you know what is happening when I see this message Trimming prompt to meet context window limitations Is it actually trimming the prompt? Is there a way to opt out of that? Wouldn't the oldest messages drop out without having to trim the prompt?

Oh, sorry @amirshawn , for some reason github didn't show me a notification for your reply (they're always enabled, but occasionally it skips some, no idea why!)

No, please don't worry about that. It doesn't trim the prompt. It only drops the first half of the messages, except the first user message. We assume the first user message is the task, so that one is always kept.

enyst · 2025-02-10T23:55:44Z

We'll try your empty suggestion. I think it should work, too, yes.

enyst · 2025-02-11T00:04:23Z

09:48:20 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #1 | You can customize retry values in the configuration.
09:48:36 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #2 | You can customize retry values in the configuration.

Just to clarify, openhands doesn't do anything to the context, related to these. No change there.

But it's normal that you saw them after the trimming happened, because what must have happened is like:

Sonnet got to 200k, which is the maximum context window of this model
we dropped some oldest messages from the context window (except the first user message, we keep that), and just send the rest
there's no guarantee or check how many tokens that means, but let's say, for the sake of the example, that the remaining messages were approximately 100k
so 4 requests to the Anthropic API in one minute would hit their RateLimit of 400k per minute.

In this case, openhands will wait a bit, then retry the request. If it still happens, it will wait a bit more, then retry. Hopefully the minute will expire/end by the time it retries, so then Anthropic will allow the request.

We really need a better solution there too. 🤔

csmith49 · 2025-02-12T16:42:32Z

I've added a quick PR that removes the messages. The truncation behavior should be unchanged: when we hit a max token limit, we drop the oldest half of the events.

From a UX perspective, do y'all think this truncation behavior is something the user should be made aware of? Right now we'll do the truncation and nobody will be the wiser. I think that's fine for normal usage, but for folks experimenting unobserved state changes are probably not welcome.

amirshawn · 2025-02-12T20:02:10Z

@csmith49 I think it's definitely something we should be able to opt out of. Has it always worked this way? How does this affect the prompt cache benefits? Also, there must be away of alerting the user when it occurs without interfering with the conversation, maybe using a toast alert or something. I am curious, if we are using prompt caching and we are within anthropic's 5 minute limit, do we send the full conversation each time? I would think we would be sending it just the new message and referencing the conversation or previous messages somehow? If that is the case, maybe only do the condensing when we've past that 5 minute limit. Since only at that point does it really have a benefit. I personally think what would be powerful is if you realize the conversation has gone a bit off track if there was a way to look at the current context and select yourself which parts of the conversation you'd like to keep. That way you could get the conversation back to relavence without having to completely restart the conversation. Especially if there was a way to sort the messages by size. Then maybe eventually having an agent that is trained to look at the full conversation, then removes unnecessary context like logs. As far as a better way of handling hitting the max, I would think throttling the message sending based on the known context limit of the llm would make sense.

amirshawn · 2025-02-12T20:30:18Z

Reading about Anthropic's prompt caching, I guess I'm a little off on my understanding of how it works. I'll have to give this some more thought. I do still think that having a way to manually clean up the context could have a huge benefit. For instance if you're working through a problem and it takes many changes to finally get everything working. There's no need to hold on to the whole process of getting to that point. If there was a away to for instance click "optimize context" then use gpt-4o-mini or which ever model you choose (I figured something cheap) to look at the whole conversation and identify unnecessary parts. If each message had an id and you sent a prompt that asked it to identify the unnecessary parts and return their id's. Then all those messages would be selected and the user would then be able to uncheck any parts of the conversation it may want to keep. Does that make sense? I think over time the whole process would be optimized and improved hopefully to a point where it doesn't even need a person to oversee it or that part would be optional. I think cutting the context in half is like using an axe for surgery instead of a scalpel. I appologize for my rambling and most likely misunderstanding how this all works. I hope my feedback is at least valuable in some way!

enyst · 2025-02-12T20:49:32Z

I think it's definitely something we should be able to opt out of. Has it always worked this way?

Just to clarify, we are talking about multiple methods in this topic; but this one is only one of them, and it's just the edge case: when the conversation hits maximum context window. Maximum context window is not per unit of time (it's not the rate limits), but it's the upper bound for each model (200k for sonnet). Once you hit it, you would not be able to do anything in that conversation, you'd get a fatal error from anthropic API and that's it, you'd need to a start a new session.

In that case, to avoid the error, and continue the conversation, we drop the first half of messages, sending only the other half. It hasn't always worked this way (before, the user got an error and was forced to restart). It works this way for quite some time now, starting from here, so like three months.

csmith49 · 2025-02-12T20:49:42Z

I think it's definitely something we should be able to opt out of. Has it always worked this way?

The context truncation has worked this way for a long while, but the notification is new.

How does this affect the prompt cache benefits?

When a truncation happens, we temporarily lose out on the cache benefits. But the cache is rebuilt and is still applicable until the next truncation.

Also, there must be away of alerting the user when it occurs without interfering with the conversation, maybe using a toast alert or something.

Some kind of notification would be good, I agree. The mentioned PR keeps the message in the chat but it is "invisible" to the LLM. Confusing UX, but maybe a fine compromise in the short term.

I am curious, if we are using prompt caching and we are within anthropic's 5 minute limit, do we send the full conversation each time? I would think we would be sending it just the new message and referencing the conversation or previous messages somehow? If that is the case, maybe only do the condensing when we've past that 5 minute limit. Since only at that point does it really have a benefit.

My understanding is that we send the full conversation each time, and the "referencing previous message/conversation" is handled by the API. But I honestly don't know that for certain.

I personally think what would be powerful is if you realize the conversation has gone a bit off track if there was a way to look at the current context and select yourself which parts of the conversation you'd like to keep. That way you could get the conversation back to relavence without having to completely restart the conversation. Especially if there was a way to sort the messages by size. Then maybe eventually having an agent that is trained to look at the full conversation, then removes unnecessary context like logs.

That's an interesting idea. I don't know how much effort that would take to build out the UX but I expect it's non-trivial.

The actual truncation strategy is something we're actively testing: we have a separate LLM with a custom prompt that looks at the events and picks a subset to keep and produces a small summary of the discarded events. (The implementation is in PR #6597).

As far as a better way of handling hitting the max, I would think throttling the message sending based on the known context limit of the llm would make sense.

We do use the model's context limit, we just have to wait for the API to tell us we hit it. Are you suggesting we compute the context size before we send any message to the LLM? That's possible in some cases, and resolving issue #6707 might take this approach.

One sticky bit is somehow handling the fact that some models with completion interfaces don't have separate tokenizers we can call. We can support custom tokenizers, but we also need a strategy for dealing with the imprecision of the mismatch (maybe truncating when we hit 90% of the context window size?).

csmith49 · 2025-02-12T20:58:27Z

I appologize for my rambling and most likely misunderstanding how this all works. I hope my feedback is at least valuable in some way!

The feedback is definitely useful! I think you might enjoy looking at the implementation of the LLMSummaryCondenser from #6597, it's the closest thing we have to your suggestion and we could definitely use some help improving the prompt and selection strategy.

amirshawn · 2025-02-12T21:07:27Z

I think it's definitely something we should be able to opt out of. Has it always worked this way?

Just to clarify, we are talking about multiple methods in this topic; but this one is only one of them, and it's just the edge case: when the conversation hits maximum context window. Maximum context window is not per unit of time (it's not the rate limits), but it's the upper bound for each model (200k for sonnet). Once you hit it, you would not be able to do anything in that conversation, you'd get a fatal error from anthropic API and that's it, you'd need to a start a new session.

In that case, to avoid the error, and continue the conversation, we drop the first half of messages, sending only the other half. It hasn't always worked this way (before, the user got an error and was forced to restart). It works this way for quite some time now, starting from here, so like three months.

I didn't realize that! Well that clears up a lot in my thinking. So a powerful condenser is extremely important.

SupaMic · 2025-02-27T09:05:08Z

I'm not sure how the limit is detected but I'm running into a context limit on Groq and getting error...
BadRequestError: litellm.BadRequestError: GroqException - {"error":{"message":"Please reduce the length of the messages or completion.","type":"invalid_request_error","param":"messages","code":"context_length_exceeded"}}
This is upon switching models within the session so this is the first message for this particular model 'DeepSeek R1 Distill Qwen 32B 128k' which has a 200k Token Per Minute limit on my tier. I have the "memory condensation" setting on.

I love the idea of an automated LLMSummaryCondenser which I could direct. I currently use a continually updated README.md within the codebase so I can scrap a session and use it as a summary to get the Agent back up to speed in a new session.

Until something like that exists reliably or we have a robust way to handle context limit errors from all APIs, I think the best way to handle this issue would be to allow raw access to the context for manual editing. Since we can already export the "trajectory" data, similarily we would just need an export of the context being sent (I think just the the model "messages" within the trajectory) , then we could edit that large context text manually and then add an import option/button to replace the current context. This would also solve the issue of contexts becoming too long for certain smaller models to handle and we could remove large sections of prompt which have become irrelevant because of later decisions that change the strategy.

So we dont need an text editing window, just an import button where we could replace the current chat context. Labeled revision history may be helpful feature on this if we wanted to have specific contexts for different areas of an app.

In any case, it would also be good to have some insight into what files from a workspace are being included by the Agents in the context as I suspect they are oversharing in some cases.

amirshawn added the bug Something isn't working label Feb 6, 2025

SmartManoj added a commit to SmartManoj/Kevin that referenced this issue Feb 6, 2025

Change to NullObservation

470ee07

Fixes All-Hands-AI#6634

kevin-support-bot bot mentioned this issue Feb 6, 2025

[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM SmartManoj/Kevin#256

Open

mamoodi added the severity:medium Affecting multiple users label Feb 6, 2025

csmith49 mentioned this issue Feb 12, 2025

fix: Filter AgentCondensationObservation events from agent state #6705

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM #6634

[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM #6634

amirshawn commented Feb 6, 2025

enyst commented Feb 6, 2025

amirshawn commented Feb 6, 2025

amirshawn commented Feb 9, 2025

amirshawn commented Feb 9, 2025

amirshawn commented Feb 10, 2025

enyst commented Feb 10, 2025 •

edited

Loading

enyst commented Feb 10, 2025

enyst commented Feb 11, 2025

csmith49 commented Feb 12, 2025

amirshawn commented Feb 12, 2025

amirshawn commented Feb 12, 2025

enyst commented Feb 12, 2025

csmith49 commented Feb 12, 2025

csmith49 commented Feb 12, 2025

amirshawn commented Feb 12, 2025

SupaMic commented Feb 27, 2025 •

edited

Loading

[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM #6634

[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM #6634

Comments

amirshawn commented Feb 6, 2025

Is there an existing issue for the same bug?

Describe the bug and reproduction steps

OpenHands Installation

OpenHands Version

Operating System

Logs, Errors, Screenshots, and Additional Context

enyst commented Feb 6, 2025

amirshawn commented Feb 6, 2025

amirshawn commented Feb 9, 2025

amirshawn commented Feb 9, 2025

amirshawn commented Feb 10, 2025

enyst commented Feb 10, 2025 • edited Loading

enyst commented Feb 10, 2025

enyst commented Feb 11, 2025

csmith49 commented Feb 12, 2025

amirshawn commented Feb 12, 2025

amirshawn commented Feb 12, 2025

enyst commented Feb 12, 2025

csmith49 commented Feb 12, 2025

csmith49 commented Feb 12, 2025

amirshawn commented Feb 12, 2025

SupaMic commented Feb 27, 2025 • edited Loading

enyst commented Feb 10, 2025 •

edited

Loading

SupaMic commented Feb 27, 2025 •

edited

Loading