Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add eval workflow that triggers remote eval job #5108

Merged
merged 4 commits into from
Nov 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/run-eval.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Run evaluation on a PR
name: Run Eval

# Runs when a PR is labeled with one of the "run-eval-" labels
on:
pull_request:
types: [labeled]

jobs:
trigger-job:
name: Trigger remote eval job
if: ${{ github.event.label.name == 'run-eval-xs' || github.event.label.name == 'run-eval-s' || github.event.label.name == 'run-eval-m' }}
runs-on: ubuntu-latest

steps:
- name: Checkout PR branch
uses: actions/checkout@v3
with:
ref: ${{ github.head_ref }}

- name: Trigger remote job
run: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the normal uses: syntax not work here? e.g. I have this in another repo to trigger actions across repos

uses: all-hands-ai/deploy/.github/workflows/_docker_push.yaml@main

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cause the remote workflow is not a reusable workflow. It can be used as a standalone as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain a bit more, why would this workflow not work in this repo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that the remote workflow is not of the type reusable workflow so you can't call it like that.
The remote workflow is a workflow that can be triggered manually.

Or did you mean why can we not bring the remote workflow into this repo?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally confused, sorry if I confuse you too. What I mean is just this:

Maybe I just assumed wrong, and this PR was never intended for that eval? 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting tools doesn't mean it is good at it, e.g. see Slack discussion.
I'm not arguing against running an eval, but against using Haiku as it is not a good model for programming.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the default is still sonnet-3-5.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tobi, I think you're right as far as Haiku goes. I also think we talk about different things? I'm not saying that Haiku would replace Sonnet 3.5 evals when Sonnet is needed. That's not my argument and not what adding Haiku here implies, afaict.

I look at it more as an addition: when we don't need Sonnet but want to see something.

If the examples above are not good, what do you think about another: a nightly run. If Haiku does, let's say, 19% usually, then seeing a nightly with 2% is a signal that something went wrong. That's all. A signal is all we need to double check on a real eval, like on Sonnet - or whatever we have to do.

Haiku doesn't need to be good, to serve as signal this way? It just needs to be better than a coin flip. 🤔 😅

Just my opinion anyway. I will be even happier when we find/decide/start evals on an open LLM.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Who decides and with which criteria, that model a or b is "needed"?
  • If cost is an issue, gpt4o is still cheaper than Sonnet, but more capable than Haiku.
  • The signal argument I don't understand: what if changes to the codebase (e.g. by a PR) would work fine or better than previous runs in Sonnet, but a smaller model (like Haiku) has more problems with it and causes the result to drop. If this is possible, then the "signal" could lead to wrong conclusions (about the PR)? Tl;dr what if the PR changes have nothing to do with the "bad signal", but just the llm not being good enough?
  • IMO using an inferior model only adds potential for extra errors or unpredictable results

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need a "cheap" eval option, which can be used as a proxy for evals on Sonnet. We can do a lot more ad hoc testing if we use cheap models, which would be cost-prohibitive with Sonnet.

Of course, this carries the assumption that results will correlate between models. And they mostly do, but ofc we've all seen examples where the correlation breaks down. So we will always need to run a final eval on Sonnet to ensure there aren't any regressions with our default model.

REPO_URL="https://github.com/${{ github.repository }}"
PR_BRANCH="${{ github.head_ref }}"
echo "Repository URL: $REPO_URL"
echo "PR Branch: $PR_BRANCH"
if [[ "${{ github.event.label.name }}" == "run-eval-xs" ]]; then
EVAL_INSTANCES="1"
elif [[ "${{ github.event.label.name }}" == "run-eval-s" ]]; then
EVAL_INSTANCES="5"
elif [[ "${{ github.event.label.name }}" == "run-eval-m" ]]; then
EVAL_INSTANCES="30"
fi
curl -X POST \
-H "Authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
-H "Accept: application/vnd.github+json" \
-d "{\"ref\": \"main\", \"inputs\": {\"github-repo\": \"${REPO_URL}\", \"github-branch\": \"${PR_BRANCH}\", \"pr-number\": \"${{ github.event.pull_request.number }}\", \"eval-instances\": \"${EVAL_INSTANCES}\"}}" \
https://api.github.com/repos/All-Hands-AI/evaluation/actions/workflows/create-branch.yml/dispatches
# Send Slack message
PR_URL="https://github.com/${{ github.repository }}/pull/${{ github.event.pull_request.number }}"
slack_text="PR $PR_URL has triggered evaluation on $EVAL_INSTANCES instances..."
curl -X POST -H 'Content-type: application/json' --data '{"text":"'"$slack_text"'"}' \
https://hooks.slack.com/services/${{ secrets.SLACK_TOKEN }}
- name: Comment on PR
uses: KeisukeYamashita/create-comment@v1
with:
unique: false
comment: |
Running evaluation on the PR. Once eval is done, the results will be posted.
Loading