-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Message: [Trimming prompt to meet context window limitations] making agent change actions. Shouldn't be passed to LLM #6634
Comments
Thank you for the report @amirshawn ! (cc: @csmith49) this must be the observation we added when the context window errored out. So during the run, the context was exceeded. It isn't context compression exactly, the only other option there would have been for a ContextWindowExceededError to stop the agent (and give ERROR state). I wonder if we should use an empty message string instead? Out of curiosity @amirshawn what LLM were you using? |
I was using Anthropic Claude Sonnet 3.5, the recommended model. I think an empty message would be totally fine or even something like, System Maintenance in progresss, no action needed. Or anything that doesn't make it take an action. |
@enyst do you know what is happening when I see this message Trimming prompt to meet context window limitations |
I notice that as the conversation gets longer this message is happening more and more often. I notice the errors: 09:48:20 - openhands:ERROR: retry_mixin.py:55 - litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 400,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}. Attempt #1 | You can customize retry values in the configuration. they seem to coincide with those messages. Which before this change a lot of times would cause the conversation to stall. I would have to respond with ok or something similar to keep things moving. I'm just wondering if it's actually changing the prompt when we see that message? Thank you! |
Oh, sorry @amirshawn , for some reason github didn't show me a notification for your reply (they're always enabled, but occasionally it skips some, no idea why!) No, please don't worry about that. It doesn't trim the prompt. It only drops the first half of the messages, except the first user message. We assume the first user message is the task, so that one is always kept. |
We'll try your empty suggestion. I think it should work, too, yes. |
Just to clarify, But it's normal that you saw them after the trimming happened, because what must have happened is like:
In this case, We really need a better solution there too. 🤔 |
I've added a quick PR that removes the messages. The truncation behavior should be unchanged: when we hit a max token limit, we drop the oldest half of the events. From a UX perspective, do y'all think this truncation behavior is something the user should be made aware of? Right now we'll do the truncation and nobody will be the wiser. I think that's fine for normal usage, but for folks experimenting unobserved state changes are probably not welcome. |
@csmith49 I think it's definitely something we should be able to opt out of. Has it always worked this way? How does this affect the prompt cache benefits? Also, there must be away of alerting the user when it occurs without interfering with the conversation, maybe using a toast alert or something. I am curious, if we are using prompt caching and we are within anthropic's 5 minute limit, do we send the full conversation each time? I would think we would be sending it just the new message and referencing the conversation or previous messages somehow? If that is the case, maybe only do the condensing when we've past that 5 minute limit. Since only at that point does it really have a benefit. I personally think what would be powerful is if you realize the conversation has gone a bit off track if there was a way to look at the current context and select yourself which parts of the conversation you'd like to keep. That way you could get the conversation back to relavence without having to completely restart the conversation. Especially if there was a way to sort the messages by size. Then maybe eventually having an agent that is trained to look at the full conversation, then removes unnecessary context like logs. As far as a better way of handling hitting the max, I would think throttling the message sending based on the known context limit of the llm would make sense. |
Reading about Anthropic's prompt caching, I guess I'm a little off on my understanding of how it works. I'll have to give this some more thought. I do still think that having a way to manually clean up the context could have a huge benefit. For instance if you're working through a problem and it takes many changes to finally get everything working. There's no need to hold on to the whole process of getting to that point. If there was a away to for instance click "optimize context" then use gpt-4o-mini or which ever model you choose (I figured something cheap) to look at the whole conversation and identify unnecessary parts. If each message had an id and you sent a prompt that asked it to identify the unnecessary parts and return their id's. Then all those messages would be selected and the user would then be able to uncheck any parts of the conversation it may want to keep. Does that make sense? I think over time the whole process would be optimized and improved hopefully to a point where it doesn't even need a person to oversee it or that part would be optional. I think cutting the context in half is like using an axe for surgery instead of a scalpel. I appologize for my rambling and most likely misunderstanding how this all works. I hope my feedback is at least valuable in some way! |
Just to clarify, we are talking about multiple methods in this topic; but this one is only one of them, and it's just the edge case: when the conversation hits maximum context window. Maximum context window is not per unit of time (it's not the rate limits), but it's the upper bound for each model (200k for sonnet). Once you hit it, you would not be able to do anything in that conversation, you'd get a fatal error from anthropic API and that's it, you'd need to a start a new session. In that case, to avoid the error, and continue the conversation, we drop the first half of messages, sending only the other half. It hasn't always worked this way (before, the user got an error and was forced to restart). It works this way for quite some time now, starting from here, so like three months. |
The context truncation has worked this way for a long while, but the notification is new.
When a truncation happens, we temporarily lose out on the cache benefits. But the cache is rebuilt and is still applicable until the next truncation.
Some kind of notification would be good, I agree. The mentioned PR keeps the message in the chat but it is "invisible" to the LLM. Confusing UX, but maybe a fine compromise in the short term.
My understanding is that we send the full conversation each time, and the "referencing previous message/conversation" is handled by the API. But I honestly don't know that for certain.
That's an interesting idea. I don't know how much effort that would take to build out the UX but I expect it's non-trivial. The actual truncation strategy is something we're actively testing: we have a separate LLM with a custom prompt that looks at the events and picks a subset to keep and produces a small summary of the discarded events. (The implementation is in PR #6597).
We do use the model's context limit, we just have to wait for the API to tell us we hit it. Are you suggesting we compute the context size before we send any message to the LLM? That's possible in some cases, and resolving issue #6707 might take this approach. One sticky bit is somehow handling the fact that some models with completion interfaces don't have separate tokenizers we can call. We can support custom tokenizers, but we also need a strategy for dealing with the imprecision of the mismatch (maybe truncating when we hit 90% of the context window size?). |
The feedback is definitely useful! I think you might enjoy looking at the implementation of the |
I didn't realize that! Well that clears up a lot in my thinking. So a powerful condenser is extremely important. |
I'm not sure how the limit is detected but I'm running into a context limit on Groq and getting error... I love the idea of an automated LLMSummaryCondenser which I could direct. I currently use a continually updated README.md within the codebase so I can scrap a session and use it as a summary to get the Agent back up to speed in a new session. Until something like that exists reliably or we have a robust way to handle context limit errors from all APIs, I think the best way to handle this issue would be to allow raw access to the context for manual editing. Since we can already export the "trajectory" data, similarily we would just need an export of the context being sent (I think just the the model "messages" within the trajectory) , then we could edit that large context text manually and then add an import option/button to replace the current context. This would also solve the issue of contexts becoming too long for certain smaller models to handle and we could remove large sections of prompt which have become irrelevant because of later decisions that change the strategy. So we dont need an text editing window, just an import button where we could replace the current chat context. Labeled revision history may be helpful feature on this if we wanted to have specific contexts for different areas of an app. In any case, it would also be good to have some insight into what files from a workspace are being included by the Agents in the context as I suspect they are oversharing in some cases. |
Is there an existing issue for the same bug?
Describe the bug and reproduction steps
I am working on a multi agent system and this message that keeps popping up is really messing with my flow, here's a couple examples:
Then this:
I thought the context compression was an option but I haven't enabled it or know how to enable or disable it.
It definitely should not be messaging the LLM in the middle of our conversation like that. If there's a way to make silent messages that would be ok but adding it into the conversation is not good.
OpenHands Installation
Docker command in README
OpenHands Version
0.23
Operating System
MacOS
Logs, Errors, Screenshots, and Additional Context
No response
The text was updated successfully, but these errors were encountered: