diff --git a/blog/2023-06-22-leaderboard.md b/blog/2023-06-22-leaderboard.md index 76b3a167..cf4b7c01 100644 --- a/blog/2023-06-22-leaderboard.md +++ b/blog/2023-06-22-leaderboard.md @@ -7,11 +7,11 @@ previewImg: /images/blog/leaderboard_week8/ability_breakdown.png In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics: -1. **Chatbot Arena Elo**, based on 42K anonymous votes from Chatbot Arena using the Elo rating system. +1. **Chatbot Arena Elo**, based on 42K anonymous votes from [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) using the Elo rating system. 2. **MT-Bench score**, based on a challenging multi-turn benchmark and GPT-4 grading, proposed and validated in our [Judging LLM-as-a-judge paper](https://arxiv.org/abs/2306.05685). 3. **MMLU**, a widely adopted [benchmark](https://arxiv.org/abs/2009.03300). -Furthermore, we’re excited to introduce our **new series of Vicuna v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations. +Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations. Their weights are now [available](https://github.com/lm-sys/FastChat/tree/main#vicuna-weights). ## Updated Leaderboard and New Models @@ -22,12 +22,12 @@ td {text-align: left}
-

Table 1. LLM Leaderboard (Timeframe: April 24 - June 22, 2023). More details at our Leaderboard

. +

Table 1. LLM Leaderboard (Timeframe: April 24 - June 22, 2023). More details at our Leaderboard.

- + @@ -35,43 +35,43 @@ td {text-align: left} - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -81,7 +81,7 @@ td {text-align: left} - +
Model MT-bench (score) Elo Rating MMLU License
GPT-4 8.99 1227 86.4 Proprietary
GPT-4 8.99 1227 86.4 Proprietary
GPT-3.5-turbo 7.94 1130 70 Proprietary
Claude-instant-v1 7.85 1156 61.3 Proprietary
Vicuna-33B 7.12 - 59.2 Weights available; Non-commercial
Vicuna-33B 7.12 - 59.2 Non-commercial
WizardLM-30B 7.01 - 58.7 Weights available; Non-commercial
WizardLM-30B 7.01 - 58.7 Non-commercial
Guanaco-33B 6.53 1065 57.6 Weights available; Non-commercial
Guanaco-33B 6.53 1065 57.6 Non-commercial
Tulu-30B 6.43 - 58.1 Weights available; Non-commercial
Tulu-30B 6.43 - 58.1 Non-commercial
Guanaco-65B 6.41 - 62.1 Weights available; Non-commercial
Guanaco-65B 6.41 - 62.1 Non-commercial
OpenAssistant-LLaMA-30B 6.41 - 55.9 Weights available; Non-commercial
OpenAssistant-LLaMA-30B 6.41 - 55.9 Non-commercial
PaLM2-Chat-Bison-001 6.4 1038 - Proprietary
Vicuna-13B 6.39 1061 52.1 Weights available; Non-commercial
Vicuna-13B 6.39 1061 52.1 Non-commercial
WizardLM-13B 6.35 1048 52.3 Weights available; Non-commercial
WizardLM-13B 6.35 1048 52.3 Non-commercial
Vicuna-7B 6 1008 47.1 Weights available; Non-commercial
Vicuna-7B 6 1008 47.1 Non-commercial
Baize-v2-13B 5.75 - 48.9 Weights available; Non-commercial
Baize-v2-13B 5.75 - 48.9 Non-commercial
Nous-Hermes-13B 5.51 - 49.3 Weights available; Non-commercial
Nous-Hermes-13B 5.51 - 49.3 Non-commercial
MPT-7B-Chat 5.42 956 32 CC-By-NC-SA-4.0
MPT-7B-Chat 5.42 956 32 CC-By-NC-SA-4.0
GPT4All-13B-Snoozy 5.41 986 43 Weights available; Non-commercial
GPT4All-13B-Snoozy 5.41 986 43 Non-commercial
Koala-13B 5.35 992 44.7 Weights available; Non-commercial
Koala-13B 5.35 992 44.7 Non-commercial
Falcon-40B-Instruct 5.17 - 54.7 Apache 2.0
H2O-Oasst-OpenLLaMA-13B 4.63 - 42.8 Weights available; Non-commercial
H2O-Oasst-OpenLLaMA-13B 4.63 - 42.8 Non-commercial
Alpaca-13B 4.53 930 48.1 Weights available; Non-commercial
Alpaca-13B 4.53 930 48.1 Non-commercial
ChatGLM-6B 4.5 905 36.1 Weights available; Non-commercial
ChatGLM-6B 4.5 905 36.1 Non-commercial
Oasst-Pythia-12B 4.32 924 27 Apache 2.0
FastChat-T5-3B 3.04 897 47.7 Apache 2.0
LLaMA-13B 2.61 826 47 Weights available; Non-commercial
LLaMA-13B 2.61 826 47 Non-commercial