Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page Refresh now restarts agent loop if status is STOPPED or ERROR #6829

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

tofarr
Copy link
Collaborator

@tofarr tofarr commented Feb 19, 2025

End-user friendly description of the problem this fixes or functionality that this introduces
The AgentLoop is now restarted on reconnect if the status is STOPPED or ERROR

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

If the runtime stops (Possibly due to an external error, out of memory, or issue with Kubernetes / Docker), the server would continue without a runtime, spewing errors but not actually handling the error properly. After this change, the AgentLoop will be restarted on reconnect

Example
This silly conversation...
image

If the docker container is deleted...
image

On main subsequent prompts fail, telling users to refresh the page...
image

But refreshing the page does not clear the issue - the Agent remains in the Error state.

After the change, a page refresh will restart the runloop. (Because it has been stopped!) The agent is still aware that something went wrong, as evidenced by the output from a continue prompt:
image


Link of any specific issues this addresses


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:f2d2048-nikolaik   --name openhands-app-f2d2048   docker.all-hands.dev/all-hands-ai/openhands:f2d2048

@enyst
Copy link
Collaborator

enyst commented Feb 19, 2025

Can't we try, on refresh, to reconnect the runtime?

@tofarr
Copy link
Collaborator Author

tofarr commented Feb 19, 2025

Can't we try, on refresh, to reconnect the runtime?

That is what this PR does - before this PR, the problem was that the agent loop would be running, and therefore would not restart. Now, a disconnected runtime triggers the agent loop to stop, so that a page refresh will restart the agent loop and thereby trigger a reconnect / restart of the runtime.

@tofarr tofarr marked this pull request as ready for review February 19, 2025 21:19
Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right we need a solution for that behavior, thank you for this.

I'm not sure this is quite the way to do it, but I may be wrong. The message callback channel was intended for displaying the error strings in the UI, using it to close the entire session is a bit surprising. What if we reconnect the runtime at refresh, and close the old agent loop if it was disconnected, at that time, at refresh time, does that make sense?

@enyst
Copy link
Collaborator

enyst commented Feb 19, 2025

On a side note, I wonder what will happen when the user has more useful things to do even without a runtime available right now: right now they can chat with the LLM, and they can try to create a delegate (these actions are not runnable actions, so they don't require a runtime). What if we make a summarization tool, the user could use? (it's not runtime either), or integrate MCP?

Just wondering, maybe I'm missing something, would they be possible with this PR?

@tofarr
Copy link
Collaborator Author

tofarr commented Feb 19, 2025

What if we reconnect the runtime at refresh, and close the old agent loop if it was disconnected, at that time, at refresh

I like this approach better. I'll update the PR.

@tofarr tofarr marked this pull request as draft February 19, 2025 21:46
@tofarr tofarr changed the title Disconnected runtime now stops the agent loop Page Refresh now restarts agent loop if status is STOPPED or ERROR Feb 23, 2025
@tofarr tofarr requested review from rbren and enyst February 23, 2025 12:06
@tofarr tofarr marked this pull request as ready for review February 23, 2025 12:08
if isinstance(event, AgentStateChangedObservation):
if event.agent_state in (
AgentState.STOPPED.value,
AgentState.ERROR.value,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be STOPPED after FINISHED for "good" reasons, not only for errors, though I could be wrong. Anyway if it stops and restarts in some innocent cases, I'm not sure that's necessarily a bad thing.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main fix of this PR looks good to me. It would be great if Robert takes a look too.

I think it's better to not include here the litellm client fix, we could perhaps keep that discussion in the other PR.

@tofarr
Copy link
Collaborator Author

tofarr commented Feb 24, 2025

I think it's better to not include here the litellm client fix, we could perhaps keep that discussion in the other PR.

Agreed - I had actually only added that in here because I was testing both at the same time in the SAAS staging environment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants