From 02765f0eaf70a4f942c0dfab9348c6432e034658 Mon Sep 17 00:00:00 2001 From: Tim Li Date: Mon, 20 May 2024 17:28:45 -0700 Subject: [PATCH 1/2] rename to Arena Hard Auto --- blog/2024-04-19-arena-hard.md | 40 +++++++++--------- .../arena_hard/arena-hard-vs-mt_bench.png | Bin 403733 -> 406545 bytes 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/blog/2024-04-19-arena-hard.md b/blog/2024-04-19-arena-hard.md index 6b7f3ed0..325f5bdf 100644 --- a/blog/2024-04-19-arena-hard.md +++ b/blog/2024-04-19-arena-hard.md @@ -13,7 +13,7 @@ We introduce Arena-Hard – a data pipeline to build high-quality benchmarks fro 1. Agreement to Human preference: whether the benchmark score has high agreement to human preference. 2. Separability: whether the benchmark can confidently separate models. -We compare our new benchmark, Arena Hard v0.1, to a current leading chat LLM benchmark, MT Bench. In Figure 1, we show Arena Hard v0.1 offers significantly stronger separability against MT Bench with tighter confidence intervals. It also has a higher agreement (89.1%, see Table 1) with the human preference ranking by Chatbot Arena (english-only). We expect to see this benchmark useful for model developers to differentiate their model checkpoints. +We compare our new benchmark, Arena Hard Auto v0.1, to a current leading chat LLM benchmark, MT Bench. In Figure 1, we show Arena Hard Auto v0.1 offers significantly stronger separability against MT Bench with tighter confidence intervals. It also has a higher agreement (89.1%, see Table 1) with the human preference ranking by Chatbot Arena (english-only). We expect to see this benchmark useful for model developers to differentiate their model checkpoints.