-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add eval workflow that triggers remote eval job #5108
Conversation
I'm working on PR comments for start and finish.... |
Running evaluation on the PR. Once eval is done, the results will be posted. |
Evaluation results: ## Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@enyst we will definitely continue enhancing this (e.g. with a link to download the full report)--lmk if you have lingering concerns
Running evaluation on the PR. Once eval is done, the results will be posted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thank you. Yes, I think we can do this better, but I really appreciate the work on this @mamoodi <3
I started one to see if it works for me. I would suggest to merge if it ends successfully.
@enyst we currently have it limited to just me, Graham, Xingyao, and Mahmoud, given the potential to accidentally spend several hundred dollars 😅 |
Running evaluation on the PR. Once eval is done, the results will be posted. |
Maybe we can have tiers here, something like:
|
I think we definitely need something, because if anyone wants to run stuff in private, one can always run stuff in private. But if this workflow is defined here on this repo, then I do think it should be useful somehow here. |
To clarify: I already talked to mamoodi before and we agreed that from my perspective, this PR is mergeable as it is. Not because it's entirely ready, but we know what we need to think about. 😹 Sorry, I did want to see if it works. 😅 I think Haiku is actually relatively expensive, after all we changed from deepseek where the difference is very significant. I'll give some thought to this:
Edited to add: |
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
This workflow triggers a remote eval job that runs an evaluation of 1 instance for now. This is to get the basic setup working.
Link of any specific issues this addresses
To run this PR locally, use the following command: