Skip to content

Commit

Permalink
Merge pull request #68 from lm-sys/arena-hard
Browse files Browse the repository at this point in the history
fixes
  • Loading branch information
CodingWithTim committed Apr 21, 2024
2 parents 2541f71 + 6f57b0b commit 48852be
Show file tree
Hide file tree
Showing 2 changed files with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion blog/2024-04-19-arena-hard.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ To avoid potential position bias, we adopt a two-game setup – per query we swa

We use gpt-4-1106-preview as the judge model to generate judgment for the model response against baseline. We take all the comparisons and compute each model’s Bradley-Terry coefficient. We then transform it to win-rate against the baseline as the final score. The 95% confidence interval is computed via 100 rounds of bootstrapping.

<p style="color:gray; text-align: center;">Arena Hard v0.1 Leaderboard</p>
<p style="color:gray; text-align: center;">Arena Hard v0.1 Leaderboard (baseline: GPT-4-0314)</p>
<div style="display: flex; justify-content: center; font-family: Consolas, monospace;">
<table style="line-height: 1; font-size: 1.0em;">
<caption style="text-align: left; color: red">*Note: GPT-4-Turbo’s high score can be due to the GPT-4 judge favoring GPT-4 outputs.</caption>
Expand Down
Binary file modified public/images/blog/arena_hard/arena-hard-vs-mt_bench.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 48852be

Please sign in to comment.