-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: (eval) Instance results with llm proxy OpenAIException
errors got merged into output.jsonl
#4166
Comments
@ryanhoangt Can you please post a traceback from the logs if you have, by any chance, or the .jsonl ? I made a quick fix in the linked PR, I'd like to look at it some more though. |
Unfortunately from the trajectory in the jsonl file there're no traceback. There's only one last entry from the {
"id": 84,
"timestamp": "2024-10-02T10:06:45.050451",
"source": "agent",
"message": "There was an unexpected error while running the agent",
"observation": "error",
"content": "There was an unexpected error while running the agent",
"extras": {}
} I'm also quite confused about whether it is |
The linked PR added retries from our LLM class, but I think a better fix will retry the eval or make sure it's not in jsonl so that it will be attempted again. |
Thanks for the fix! Btw can you explain why retrying the whole eval is better? Not sure about the architectural side, but imo it may be not necessary to run again from the first step (especially when we're at the very end of the trajectory). |
Oh, they're not exclusive. The request is retried now, and we can configure the retry settings to make more attempts (in But well, there will be a limit, so my thinking here is simply that if the proxy continues to be unavailable at that time I'm guessing the reasonable thing is to give it up, just don't save it in the jsonl so we can rerun it. 🤔 |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
I think I saw another error merged into the jsonl, but... when it was only 1 task and 1 worker. We usually use multiprocessing lately, which might be why we don't see it. Maybe. On the other hand, we have meanwhile made more fixes and added some retry when inference ends abnormally, before it gets to the output file, maybe it was fixed. |
Yeah, from my side I can see the retries happen after your fix. Recently with the new LLM proxy I don't even receive 502 errors anymore. Maybe this PR can be closed. |
Is there an existing issue for the same bug?
Describe the bug
When running the eval via the All Hands AI's LLM proxy, sometimes the server crashed with 502 response. The eval result is still collected into the
output.jsonl
file with theerror
field being:Then we have to manually filter out instances with that error and rerun. Maybe we should have some kind of logic to automatically retry for this scenario.
Current OpenHands version
Installation and Configuration
Model and Agent
Operating System
Linux
Reproduction Steps
No response
Logs, Errors, Screenshots, and Additional Context
No response
The text was updated successfully, but these errors were encountered: