Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpt-4o-mini to smoke test github workflow and make smoke test judge more reliable #873

Merged
merged 2 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions .github/workflows/smoke.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ jobs:

echo "run_smoke_tests=false" >> $GITHUB_OUTPUT

gpt-4o-2024-05-13:
gpt-4o-2024-08-06:
needs: check-label
if: ${{ needs.check-label.outputs.run_smoke_tests == 'true' }}
runs-on: ubuntu-22.04
Expand All @@ -81,14 +81,14 @@ jobs:
go-version: "1.21"
- env:
OPENAI_API_KEY: ${{ secrets.SMOKE_OPENAI_API_KEY }}
GPTSCRIPT_DEFAULT_MODEL: gpt-4o-2024-05-13
name: Run smoke test for gpt-4o-2024-05-13
GPTSCRIPT_DEFAULT_MODEL: gpt-4o-2024-08-06
name: Run smoke test for gpt-4o-2024-08-06
run: |
echo "Running smoke test for model gpt-4o-2024-05-13"
echo "Running smoke test for model gpt-4o-2024-08-06"
export PATH="$(pwd)/bin:${PATH}"
make smoke

gpt-4-turbo-2024-04-09:
gpt-4o-mini-2024-07-18:
needs: check-label
if: ${{ needs.check-label.outputs.run_smoke_tests == 'true' }}
runs-on: ubuntu-22.04
Expand All @@ -110,10 +110,10 @@ jobs:
go-version: "1.21"
- env:
OPENAI_API_KEY: ${{ secrets.SMOKE_OPENAI_API_KEY }}
GPTSCRIPT_DEFAULT_MODEL: gpt-4-turbo-2024-04-09
name: Run smoke test for gpt-4-turbo-2024-04-09
GPTSCRIPT_DEFAULT_MODEL: gpt-4o-mini-2024-07-18
name: Run smoke test for gpt-4o-mini-2024-07-18
run: |
echo "Running smoke test for model gpt-4-turbo-2024-04-09"
echo "Running smoke test for model gpt-4o-mini-2024-07-18"
export PATH="$(pwd)/bin:${PATH}"
make smoke

Expand Down
6 changes: 4 additions & 2 deletions pkg/tests/judge/judge.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ After making a determination, respond with a JSON object that conforms to the fo
]
}

If you determine actual and expected are not equivalent, include a diff of the parts of actual and expected that are not equivalent in the reasoning field of your response.

Your responses are concise and include only the json object described above.
`

Expand Down Expand Up @@ -84,10 +86,10 @@ func New[T any](client *openai.Client) (*Judge[T], error) {
}

func (j *Judge[T]) Equal(ctx context.Context, expected, actual T, criteria string) (equal bool, reasoning string, err error) {
comparisonJSON, err := json.MarshalIndent(&comparison[T]{
comparisonJSON, err := json.Marshal(&comparison[T]{
Expected: expected,
Actual: actual,
}, "", " ")
})
if err != nil {
return false, "", fmt.Errorf("failed to marshal judge testcase JSON: %w", err)
}
Expand Down
9 changes: 7 additions & 2 deletions pkg/tests/smoke/smoke_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,8 @@ func TestSmoke(t *testing.T) {
expectedEvents,
actualEvents,
`
- disregard differences in timestamps, generated IDs, natural language verbiage, and event order
- omit callProgress events from the comparison
- disregard differences in event order, timestamps, generated IDs, and natural language verbiage, grammar, and punctuation
- compare events with matching event types
- the overall stream of events and set of tools called should roughly match
- arguments passed in tool calls should be roughly the same
- the final callFinish event should be semantically similar
Expand Down Expand Up @@ -175,6 +175,11 @@ func getActualEvents(t *testing.T, eventsFile string) []event {

var e event
require.NoError(t, json.Unmarshal([]byte(line), &e))

if e.Type == runner.EventTypeCallProgress {
continue
}

events = append(events, e)
}

Expand Down
Loading