-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page Refresh now restarts agent loop if status is STOPPED or ERROR #6829
base: main
Are you sure you want to change the base?
Conversation
Can't we try, on refresh, to reconnect the runtime? |
That is what this PR does - before this PR, the problem was that the agent loop would be running, and therefore would not restart. Now, a disconnected runtime triggers the agent loop to stop, so that a page refresh will restart the agent loop and thereby trigger a reconnect / restart of the runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right we need a solution for that behavior, thank you for this.
I'm not sure this is quite the way to do it, but I may be wrong. The message callback channel was intended for displaying the error strings in the UI, using it to close the entire session is a bit surprising. What if we reconnect the runtime at refresh, and close the old agent loop if it was disconnected, at that time, at refresh time, does that make sense?
On a side note, I wonder what will happen when the user has more useful things to do even without a runtime available right now: right now they can chat with the LLM, and they can try to create a delegate (these actions are not runnable actions, so they don't require a runtime). What if we make a summarization tool, the user could use? (it's not runtime either), or integrate MCP? Just wondering, maybe I'm missing something, would they be possible with this PR? |
I like this approach better. I'll update the PR. |
openhands/server/conversation_manager/standalone_conversation_manager.py
Outdated
Show resolved
Hide resolved
…:All-Hands-AI/OpenHands into fix-disconnected-runtime-stop-agent-loop
This reverts commit bf82f75.
…me-stop-agent-loop
if isinstance(event, AgentStateChangedObservation): | ||
if event.agent_state in ( | ||
AgentState.STOPPED.value, | ||
AgentState.ERROR.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be STOPPED after FINISHED for "good" reasons, not only for errors, though I could be wrong. Anyway if it stops and restarts in some innocent cases, I'm not sure that's necessarily a bad thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main fix of this PR looks good to me. It would be great if Robert takes a look too.
I think it's better to not include here the litellm client fix, we could perhaps keep that discussion in the other PR.
Agreed - I had actually only added that in here because I was testing both at the same time in the SAAS staging environment |
End-user friendly description of the problem this fixes or functionality that this introduces
The AgentLoop is now restarted on reconnect if the status is STOPPED or ERROR
Give a summary of what the PR does, explaining any non-trivial design decisions
If the runtime stops (Possibly due to an external error, out of memory, or issue with Kubernetes / Docker), the server would continue without a runtime, spewing errors but not actually handling the error properly. After this change, the AgentLoop will be restarted on reconnect
Example
data:image/s3,"s3://crabby-images/9e3c4/9e3c487cabfe2417b90a29c3c5e658f8596f6898" alt="image"
This silly conversation...
If the docker container is deleted...
data:image/s3,"s3://crabby-images/375d9/375d92b770dc4807ab0bf99027ba502b92e70d94" alt="image"
On main subsequent prompts fail, telling users to refresh the page...
data:image/s3,"s3://crabby-images/4ae3a/4ae3aedf1db8f0804963e16649c64170650288c1" alt="image"
But refreshing the page does not clear the issue - the Agent remains in the Error state.
After the change, a page refresh will restart the runloop. (Because it has been stopped!) The agent is still aware that something went wrong, as evidenced by the output from a continue prompt:
data:image/s3,"s3://crabby-images/f39ee/f39ee5c2fbc4b5a6d1a4e43783dc8a3e3f2e8a57" alt="image"
Link of any specific issues this addresses
To run this PR locally, use the following command: