Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add eval workflow that triggers remote eval job #5108

Merged
merged 4 commits into from
Nov 22, 2024
Merged

Add eval workflow that triggers remote eval job #5108

merged 4 commits into from
Nov 22, 2024

Conversation

mamoodi
Copy link
Collaborator

@mamoodi mamoodi commented Nov 18, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This workflow triggers a remote eval job that runs an evaluation of 1 instance for now. This is to get the basic setup working.


Link of any specific issues this addresses


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b2e8440-nikolaik   --name openhands-app-b2e8440   docker.all-hands.dev/all-hands-ai/openhands:b2e8440

@mamoodi mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 18, 2024
@amanape amanape added the run-eval-xs Runs evaluation with 1 instance label Nov 19, 2024
@mamoodi mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 19, 2024
@mamoodi mamoodi changed the title Test eval workflow Add eval workflow that triggers remote eval job Nov 19, 2024
@mamoodi mamoodi marked this pull request as ready for review November 19, 2024 17:40
@mamoodi mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 19, 2024
@mamoodi mamoodi added run-eval-s Runs evaluation with 5 instances and removed run-eval-xs Runs evaluation with 1 instance labels Nov 21, 2024
@mamoodi
Copy link
Collaborator Author

mamoodi commented Nov 21, 2024

I'm working on PR comments for start and finish....

@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Nov 21, 2024
@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Nov 22, 2024
@mamoodi mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-s Runs evaluation with 5 instances labels Nov 22, 2024
@All-Hands-AI All-Hands-AI deleted a comment from github-actions bot Nov 22, 2024
@mamoodi mamoodi added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 22, 2024
Copy link
Contributor

Running evaluation on the PR. Once eval is done, the results will be posted.

@openhands-agent
Copy link
Contributor

Evaluation results: ## Summary

  • submitted instances: 1
  • empty patch instances: 0
  • resolved instances: 1
  • unresolved instances: 0
  • error instances: 0

@mamoodi mamoodi requested a review from enyst November 22, 2024 15:51
Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@enyst we will definitely continue enhancing this (e.g. with a link to download the full report)--lmk if you have lingering concerns

@enyst enyst added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 22, 2024
Copy link
Contributor

Running evaluation on the PR. Once eval is done, the results will be posted.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thank you. Yes, I think we can do this better, but I really appreciate the work on this @mamoodi <3

I started one to see if it works for me. I would suggest to merge if it ends successfully.

@rbren
Copy link
Collaborator

rbren commented Nov 22, 2024

@enyst we currently have it limited to just me, Graham, Xingyao, and Mahmoud, given the potential to accidentally spend several hundred dollars 😅

@rbren rbren added run-eval-xs Runs evaluation with 1 instance and removed run-eval-xs Runs evaluation with 1 instance labels Nov 22, 2024
Copy link
Contributor

Running evaluation on the PR. Once eval is done, the results will be posted.

@rbren
Copy link
Collaborator

rbren commented Nov 22, 2024

Maybe we can have tiers here, something like:

  • anyone can run an xs
  • maintainers can run s or m
  • admins can run large

@enyst
Copy link
Collaborator

enyst commented Nov 22, 2024

I think we definitely need something, because if anyone wants to run stuff in private, one can always run stuff in private. But if this workflow is defined here on this repo, then I do think it should be useful somehow here.

@enyst
Copy link
Collaborator

enyst commented Nov 22, 2024

To clarify: I already talked to mamoodi before and we agreed that from my perspective, this PR is mergeable as it is. Not because it's entirely ready, but we know what we need to think about. 😹

Sorry, I did want to see if it works. 😅

I think Haiku is actually relatively expensive, after all we changed from deepseek where the difference is very significant.

I'll give some thought to this:

anyone can run an xs
maintainers can run s or m
admins can run large

Edited to add:
I'll go ahead and merge, because it's just a workflow... But I think we kinda should follow up soon with some decision IMHO. Because I think this is unique: a set of labels (run-eval-) that are useless to maintainers/collaborators who can set them. It feels bugged. 😀

@enyst enyst merged commit 36e3dc5 into main Nov 22, 2024
32 checks passed
@enyst enyst deleted the mh/test-eval-wf branch November 22, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-eval-xs Runs evaluation with 1 instance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants