Add eval workflow that triggers remote eval job #5108

mamoodi · 2024-11-18T18:29:15Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This workflow triggers a remote eval job that runs an evaluation of 1 instance for now. This is to get the basic setup working.

Link of any specific issues this addresses

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b2e8440-nikolaik   --name openhands-app-b2e8440   docker.all-hands.dev/all-hands-ai/openhands:b2e8440

mamoodi · 2024-11-21T17:29:14Z

I'm working on PR comments for start and finish....

github-actions · 2024-11-22T15:26:28Z

Running evaluation on the PR. Once eval is done, the results will be posted.

openhands-agent · 2024-11-22T15:38:50Z

Evaluation results: ## Summary

submitted instances: 1
empty patch instances: 0
resolved instances: 1
unresolved instances: 0
error instances: 0

rbren

LGTM!

@enyst we will definitely continue enhancing this (e.g. with a link to download the full report)--lmk if you have lingering concerns

github-actions · 2024-11-22T16:46:41Z

Running evaluation on the PR. Once eval is done, the results will be posted.

enyst

OK, thank you. Yes, I think we can do this better, but I really appreciate the work on this @mamoodi <3

I started one to see if it works for me. I would suggest to merge if it ends successfully.

rbren · 2024-11-22T17:37:46Z

@enyst we currently have it limited to just me, Graham, Xingyao, and Mahmoud, given the potential to accidentally spend several hundred dollars 😅

github-actions · 2024-11-22T17:38:27Z

Running evaluation on the PR. Once eval is done, the results will be posted.

rbren · 2024-11-22T17:38:59Z

Maybe we can have tiers here, something like:

anyone can run an xs
maintainers can run s or m
admins can run large

enyst · 2024-11-22T17:50:07Z

I think we definitely need something, because if anyone wants to run stuff in private, one can always run stuff in private. But if this workflow is defined here on this repo, then I do think it should be useful somehow here.

enyst · 2024-11-22T18:18:08Z

To clarify: I already talked to mamoodi before and we agreed that from my perspective, this PR is mergeable as it is. Not because it's entirely ready, but we know what we need to think about. 😹

Sorry, I did want to see if it works. 😅

I think Haiku is actually relatively expensive, after all we changed from deepseek where the difference is very significant.

I'll give some thought to this:

anyone can run an xs
maintainers can run s or m
admins can run large

Edited to add:
I'll go ahead and merge, because it's just a workflow... But I think we kinda should follow up soon with some decision IMHO. Because I think this is unique: a set of labels (run-eval-) that are useless to maintainers/collaborators who can set them. It feels bugged. 😀

mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 18, 2024

amanape added the run-eval-xs Runs evaluation with 1 instance label Nov 19, 2024

mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 19, 2024

mamoodi changed the title ~~Test eval workflow~~ Add eval workflow that triggers remote eval job Nov 19, 2024

mamoodi marked this pull request as ready for review November 19, 2024 17:40

mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 19, 2024

mamoodi added 2 commits November 21, 2024 10:39

Add different labels for different evals

ddbcf78

Add instances to message

d179933

mamoodi added run-eval-s Runs evaluation with 5 instances and removed run-eval-xs Runs evaluation with 1 instance labels Nov 21, 2024

Add comment notifying eval has started

b2e8440

All-Hands-AI deleted a comment from openhands-agent Nov 21, 2024

All-Hands-AI deleted a comment from openhands-agent Nov 22, 2024

mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-s Runs evaluation with 5 instances labels Nov 22, 2024

All-Hands-AI deleted a comment from github-actions bot Nov 22, 2024

mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 22, 2024

mamoodi requested a review from enyst November 22, 2024 15:51

rbren approved these changes Nov 22, 2024

View reviewed changes

enyst added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 22, 2024

enyst approved these changes Nov 22, 2024

View reviewed changes

rbren added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 22, 2024

enyst merged commit 36e3dc5 into main Nov 22, 2024
32 checks passed

enyst deleted the mh/test-eval-wf branch November 22, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval workflow that triggers remote eval job #5108

Add eval workflow that triggers remote eval job #5108

mamoodi commented Nov 18, 2024 •

edited by github-actions bot

Loading

mamoodi commented Nov 21, 2024

github-actions bot commented Nov 22, 2024

openhands-agent commented Nov 22, 2024

rbren left a comment

github-actions bot commented Nov 22, 2024

enyst left a comment

rbren commented Nov 22, 2024

github-actions bot commented Nov 22, 2024

rbren commented Nov 22, 2024

enyst commented Nov 22, 2024

enyst commented Nov 22, 2024 •

edited

Loading

Add eval workflow that triggers remote eval job #5108

Add eval workflow that triggers remote eval job #5108

Conversation

mamoodi commented Nov 18, 2024 • edited by github-actions bot Loading

mamoodi commented Nov 21, 2024

github-actions bot commented Nov 22, 2024

openhands-agent commented Nov 22, 2024

rbren left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 22, 2024

enyst left a comment

Choose a reason for hiding this comment

rbren commented Nov 22, 2024

github-actions bot commented Nov 22, 2024

rbren commented Nov 22, 2024

enyst commented Nov 22, 2024

enyst commented Nov 22, 2024 • edited Loading

mamoodi commented Nov 18, 2024 •

edited by github-actions bot

Loading

enyst commented Nov 22, 2024 •

edited

Loading