Page Refresh now restarts agent loop if status is STOPPED or ERROR #6829

tofarr · 2025-02-19T13:46:35Z

End-user friendly description of the problem this fixes or functionality that this introduces
The AgentLoop is now restarted on reconnect if the status is STOPPED or ERROR

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

If the runtime stops (Possibly due to an external error, out of memory, or issue with Kubernetes / Docker), the server would continue without a runtime, spewing errors but not actually handling the error properly. After this change, the AgentLoop will be restarted on reconnect

Example
This silly conversation...

If the docker container is deleted...

On main subsequent prompts fail, telling users to refresh the page...

But refreshing the page does not clear the issue - the Agent remains in the Error state.

After the change, a page refresh will restart the runloop. (Because it has been stopped!) The agent is still aware that something went wrong, as evidenced by the output from a continue prompt:

Link of any specific issues this addresses

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:f2d2048-nikolaik   --name openhands-app-f2d2048   docker.all-hands.dev/all-hands-ai/openhands:f2d2048

openhands/server/session/agent_session.py

enyst · 2025-02-19T15:07:05Z

Can't we try, on refresh, to reconnect the runtime?

tofarr · 2025-02-19T15:29:14Z

Can't we try, on refresh, to reconnect the runtime?

That is what this PR does - before this PR, the problem was that the agent loop would be running, and therefore would not restart. Now, a disconnected runtime triggers the agent loop to stop, so that a page refresh will restart the agent loop and thereby trigger a reconnect / restart of the runtime.

enyst

You're right we need a solution for that behavior, thank you for this.

I'm not sure this is quite the way to do it, but I may be wrong. The message callback channel was intended for displaying the error strings in the UI, using it to close the entire session is a bit surprising. What if we reconnect the runtime at refresh, and close the old agent loop if it was disconnected, at that time, at refresh time, does that make sense?

enyst · 2025-02-19T21:36:23Z

On a side note, I wonder what will happen when the user has more useful things to do even without a runtime available right now: right now they can chat with the LLM, and they can try to create a delegate (these actions are not runnable actions, so they don't require a runtime). What if we make a summarization tool, the user could use? (it's not runtime either), or integrate MCP?

Just wondering, maybe I'm missing something, would they be possible with this PR?

tofarr · 2025-02-19T21:45:59Z

What if we reconnect the runtime at refresh, and close the old agent loop if it was disconnected, at that time, at refresh

I like this approach better. I'll update the PR.

openhands/server/conversation_manager/standalone_conversation_manager.py

…:All-Hands-AI/OpenHands into fix-disconnected-runtime-stop-agent-loop

This reverts commit bf82f75.

…me-stop-agent-loop

enyst · 2025-02-24T11:45:38Z

openhands/server/conversation_manager/standalone_conversation_manager.py

+            if isinstance(event, AgentStateChangedObservation):
+                if event.agent_state in (
+                    AgentState.STOPPED.value,
+                    AgentState.ERROR.value,


I think it can be STOPPED after FINISHED for "good" reasons, not only for errors, though I could be wrong. Anyway if it stops and restarts in some innocent cases, I'm not sure that's necessarily a bad thing.

enyst

The main fix of this PR looks good to me. It would be great if Robert takes a look too.

I think it's better to not include here the litellm client fix, we could perhaps keep that discussion in the other PR.

tofarr · 2025-02-24T13:51:25Z

I think it's better to not include here the litellm client fix, we could perhaps keep that discussion in the other PR.

Agreed - I had actually only added that in here because I was testing both at the same time in the SAAS staging environment

tofarr added 2 commits February 19, 2025 13:46

Disconnected runtime now stops the agent loop

2515c2e

Updated comment

2a31e4d

tofarr commented Feb 19, 2025

View reviewed changes

openhands/server/session/agent_session.py Show resolved Hide resolved

Merge branch 'main' into fix-disconnected-runtime-stop-agent-loop

2477d16

tofarr marked this pull request as ready for review February 19, 2025 21:19

enyst requested changes Feb 19, 2025

View reviewed changes

tofarr marked this pull request as draft February 19, 2025 21:46

rbren reviewed Feb 19, 2025

View reviewed changes

openhands/server/conversation_manager/standalone_conversation_manager.py Outdated Show resolved Hide resolved

tofarr added 17 commits February 20, 2025 09:06

More logging to catch errors

288f625

Merge branch 'fix-disconnected-runtime-stop-agent-loop' of github.com…

7de9324

…:All-Hands-AI/OpenHands into fix-disconnected-runtime-stop-agent-loop

More debugging

35e94dc

Restart on refresh if event stream in error or stopped state

954ccb1

Add prompt caching

9e17c76

Trace LLM Params

91fe248

Checking to see if the api key is getting lost

0448979

I need that API key to reproduce this issue locally

dd5c00c

WIP

cfbd262

More logging

c4364d0

Merge branch 'main' into fix-disconnected-runtime-stop-agent-loop

1943dc4

Removed approach that is no longer used

e65a8db

Revert "Revert "Fix: File Descriptor leak" (#6887)"

4f36ae0

This reverts commit bf82f75.

Fix for log completions

4bd3290

Added test

8b41359

Merge branch 'main' into fix-disconnected-runtime-stop-agent-loop

028d33a

Merge branch 'fix-file-descriptor-leak-2' into fix-disconnected-runti…

db47541

…me-stop-agent-loop

tofarr changed the title ~~Disconnected runtime now stops the agent loop~~ Page Refresh now restarts agent loop if status is STOPPED or ERROR Feb 23, 2025

tofarr requested review from rbren and enyst February 23, 2025 12:06

tofarr marked this pull request as ready for review February 23, 2025 12:08

enyst reviewed Feb 24, 2025

View reviewed changes

enyst approved these changes Feb 24, 2025

View reviewed changes

Merge branch 'main' into fix-disconnected-runtime-stop-agent-loop

f2d2048

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page Refresh now restarts agent loop if status is STOPPED or ERROR #6829

Page Refresh now restarts agent loop if status is STOPPED or ERROR #6829

tofarr commented Feb 19, 2025 •

edited by github-actions bot

Loading

enyst commented Feb 19, 2025

tofarr commented Feb 19, 2025 •

edited

Loading

enyst left a comment

enyst commented Feb 19, 2025

tofarr commented Feb 19, 2025

enyst Feb 24, 2025

enyst left a comment

tofarr commented Feb 24, 2025

Page Refresh now restarts agent loop if status is STOPPED or ERROR #6829

Are you sure you want to change the base?

Page Refresh now restarts agent loop if status is STOPPED or ERROR #6829

Conversation

tofarr commented Feb 19, 2025 • edited by github-actions bot Loading

enyst commented Feb 19, 2025

tofarr commented Feb 19, 2025 • edited Loading

enyst left a comment

Choose a reason for hiding this comment

enyst commented Feb 19, 2025

tofarr commented Feb 19, 2025

enyst Feb 24, 2025

Choose a reason for hiding this comment

enyst left a comment

Choose a reason for hiding this comment

tofarr commented Feb 24, 2025

tofarr commented Feb 19, 2025 •

edited by github-actions bot

Loading

tofarr commented Feb 19, 2025 •

edited

Loading