-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add eval workflow that triggers remote eval job #5108
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Run evaluation on a PR | ||
name: Run Eval | ||
|
||
# Runs when a PR is labeled with one of the "run-eval-" labels | ||
on: | ||
pull_request: | ||
types: [labeled] | ||
|
||
jobs: | ||
trigger-job: | ||
name: Trigger remote eval job | ||
if: ${{ github.event.label.name == 'run-eval-xs' || github.event.label.name == 'run-eval-s' || github.event.label.name == 'run-eval-m' }} | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- name: Checkout PR branch | ||
uses: actions/checkout@v3 | ||
with: | ||
ref: ${{ github.head_ref }} | ||
|
||
- name: Trigger remote job | ||
run: | | ||
REPO_URL="https://github.com/${{ github.repository }}" | ||
PR_BRANCH="${{ github.head_ref }}" | ||
echo "Repository URL: $REPO_URL" | ||
echo "PR Branch: $PR_BRANCH" | ||
if [[ "${{ github.event.label.name }}" == "run-eval-xs" ]]; then | ||
EVAL_INSTANCES="1" | ||
elif [[ "${{ github.event.label.name }}" == "run-eval-s" ]]; then | ||
EVAL_INSTANCES="5" | ||
elif [[ "${{ github.event.label.name }}" == "run-eval-m" ]]; then | ||
EVAL_INSTANCES="30" | ||
fi | ||
curl -X POST \ | ||
-H "Authorization: Bearer ${{ secrets.PAT_TOKEN }}" \ | ||
-H "Accept: application/vnd.github+json" \ | ||
-d "{\"ref\": \"main\", \"inputs\": {\"github-repo\": \"${REPO_URL}\", \"github-branch\": \"${PR_BRANCH}\", \"pr-number\": \"${{ github.event.pull_request.number }}\", \"eval-instances\": \"${EVAL_INSTANCES}\"}}" \ | ||
https://api.github.com/repos/All-Hands-AI/evaluation/actions/workflows/create-branch.yml/dispatches | ||
# Send Slack message | ||
PR_URL="https://github.com/${{ github.repository }}/pull/${{ github.event.pull_request.number }}" | ||
slack_text="PR $PR_URL has triggered evaluation on $EVAL_INSTANCES instances..." | ||
curl -X POST -H 'Content-type: application/json' --data '{"text":"'"$slack_text"'"}' \ | ||
https://hooks.slack.com/services/${{ secrets.SLACK_TOKEN }} | ||
- name: Comment on PR | ||
uses: KeisukeYamashita/create-comment@v1 | ||
with: | ||
unique: false | ||
comment: | | ||
Running evaluation on the PR. Once eval is done, the results will be posted. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the normal
uses:
syntax not work here? e.g. I have this in another repo to trigger actions across reposThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cause the remote workflow is not a reusable workflow. It can be used as a standalone as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain a bit more, why would this workflow not work in this repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that the remote workflow is not of the type reusable workflow so you can't call it like that.
The remote workflow is a workflow that can be triggered manually.
Or did you mean why can we not bring the remote workflow into this repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally confused, sorry if I confuse you too. What I mean is just this:
Maybe I just assumed wrong, and this PR was never intended for that eval? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting tools doesn't mean it is good at it, e.g. see Slack discussion.
I'm not arguing against running an eval, but against using Haiku as it is not a good model for programming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now the default is still sonnet-3-5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tobi, I think you're right as far as Haiku goes. I also think we talk about different things? I'm not saying that Haiku would replace Sonnet 3.5 evals when Sonnet is needed. That's not my argument and not what adding Haiku here implies, afaict.
I look at it more as an addition: when we don't need Sonnet but want to see something.
If the examples above are not good, what do you think about another: a nightly run. If Haiku does, let's say, 19% usually, then seeing a nightly with 2% is a signal that something went wrong. That's all. A signal is all we need to double check on a real eval, like on Sonnet - or whatever we have to do.
Haiku doesn't need to be good, to serve as signal this way? It just needs to be better than a coin flip. 🤔 😅
Just my opinion anyway. I will be even happier when we find/decide/start evals on an open LLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need a "cheap" eval option, which can be used as a proxy for evals on Sonnet. We can do a lot more ad hoc testing if we use cheap models, which would be cost-prohibitive with Sonnet.
Of course, this carries the assumption that results will correlate between models. And they mostly do, but ofc we've all seen examples where the correlation breaks down. So we will always need to run a final eval on Sonnet to ensure there aren't any regressions with our default model.