diff --git a/blog/2023-05-03-arena.md b/blog/2023-05-03-arena.md index 8d085a27..41784391 100644 --- a/blog/2023-05-03-arena.md +++ b/blog/2023-05-03-arena.md @@ -13,7 +13,7 @@ td {text-align: left}
-

Table 1. Elo ratings of popular open-source large language models. (Timeframe: April 24 - May 1, 2023)

+

Table 1. LLM Leaderboard (Timeframe: April 24 - May 1, 2023). The latest and detailed version here.

@@ -51,7 +51,7 @@ td {text-align: left} ­ -Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org). +Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).

Figure 1. The side-by-side chatting and voting interface.

@@ -59,6 +59,7 @@ Table 1 displays the Elo ratings of nine popular models, which are based on the Please note that we periodically release blog posts to update the leaderboard. Feel free to check the following updates: - [May 10 Updates](https://lmsys.org/blog/2023-05-10-leaderboard/) - [May 25 Updates](https://lmsys.org/blog/2023-05-25-leaderboard/) +- [June 22 Updates](https://lmsys.org/blog/2023-06-22-leaderboard/) ## Introduction Following the great success of ChatGPT, there has been a proliferation of open-source large language models that are finetuned to follow instructions. These models are capable of providing valuable assistance in response to users’ questions/prompts. Notable examples include Alpaca and Vicuna, based on LLaMA, and OpenAssistant and Dolly, based on Pythia. diff --git a/blog/2023-05-10-leaderboard.md b/blog/2023-05-10-leaderboard.md index 5292ce10..39b4dd02 100644 --- a/blog/2023-05-10-leaderboard.md +++ b/blog/2023-05-10-leaderboard.md @@ -14,7 +14,7 @@ In this update, we have added 4 new yet strong players into the Arena, including - Anthropic Claude-v1 - RWKV-4-Raven-14B -Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1iI_IszGAwSMkdfUrIDI6NfTG7tGDDRxZ?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org). +Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1iI_IszGAwSMkdfUrIDI6NfTG7tGDDRxZ?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).
-

Table 1. Elo ratings of LLMs (Timeframe: April 24 - May 8, 2023)

+

Table 1. LLM Leaderboard (Timeframe: April 24 - May 8, 2023). The latest and detailed version here.

diff --git a/blog/2023-05-25-leaderboard.md b/blog/2023-05-25-leaderboard.md index 8544cf47..38cc05c3 100644 --- a/blog/2023-05-25-leaderboard.md +++ b/blog/2023-05-25-leaderboard.md @@ -15,7 +15,7 @@ In this update, we are excited to welcome the following models joining the [Chat A new Elo rating leaderboard based on the 27K anonymous voting data collected **in the wild** between April 24 and May 22, 2023 is released in Table 1 below. We provide a [Google Colab notebook](https://colab.research.google.com/drive/17L9uCiAivzWfzOxo2Tb9RMauT7vS6nVU?usp=sharing) to analyze the voting data, including the computation of the Elo ratings. -You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org). +You can also try the voting [demo](https://arena.lmsys.org).
-

Table 1. Elo ratings of LLMs (Timeframe: April 24 - May 22, 2023)

+

Table 1. LLM Leaderboard (Timeframe: April 24 - May 22, 2023). The latest and detailed version here.

Rank Model Elo Rating Description License
diff --git a/blog/2023-06-22-leaderboard.md b/blog/2023-06-22-leaderboard.md index 52a01dff..4d3b4276 100644 --- a/blog/2023-06-22-leaderboard.md +++ b/blog/2023-06-22-leaderboard.md @@ -11,7 +11,7 @@ In this blog post, we share the latest update on Chatbot Arena leaderboard, whic 2. **MT-Bench score**, based on a challenging multi-turn benchmark and GPT-4 grading, proposed and validated in our [Judging LLM-as-a-judge paper](https://arxiv.org/abs/2306.05685). 3. **MMLU**, a widely adopted [benchmark](https://arxiv.org/abs/2009.03300). -Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations. +Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations. Their weights are now [available](https://github.com/lm-sys/FastChat/tree/main#vicuna-weights). ## Updated Leaderboard and New Models @@ -132,7 +132,7 @@ th:nth-child(1) .arrow-down {
-

Table 1. LLM Leaderboard (Timeframe: April 24 - June 22, 2023). More details at our Leaderboard.

+

Table 1. LLM Leaderboard (Timeframe: April 24 - June 19, 2023). The latest and detailed version here.

Rank Model Elo Rating Description License
@@ -198,10 +198,9 @@ th:nth-child(1) .arrow-down { ­ -Welcome to check more details on our latest [leaderboard](https://chat.lmsys.org/?leaderboard) and try the [Chatbot Arena](https://chat.lmsys.org/?arena). +Welcome to try the Chatbot Arena Voting [Demo](https://chat.lmsys.org/?arena). Keep in mind that each benchmark has its limitations. Please consider the results as guiding references. See our discussion below for more technical details. - ## Evaluating Chatbots with MT-bench and Arena ### Motivation diff --git a/public/images/blog/leaderboard_week8/ability_breakdown.png b/public/images/blog/leaderboard_week8/ability_breakdown.png index 65dd234c..87c6a740 100644 Binary files a/public/images/blog/leaderboard_week8/ability_breakdown.png and b/public/images/blog/leaderboard_week8/ability_breakdown.png differ
Model MT-bench (score) Elo Rating MMLU License