Update table captions and ability_breakdown.png

lm-sys · Jun 22, 2023 · e32a39b · e32a39b
1 parent f31b098
commit e32a39b
Show file tree

Hide file tree

Showing 5 changed files with 10 additions and 10 deletions.
diff --git a/blog/2023-05-03-arena.md b/blog/2023-05-03-arena.md
@@ -13,7 +13,7 @@ td {text-align: left}
 </style>
 
 <br>
-<p style="color:gray; text-align: center;">Table 1. Elo ratings of popular open-source large language models. (Timeframe: April 24 - May 1, 2023)</p>
+<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - May 1, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
 <table style="display: flex; justify-content: center;" align="left" >
 <tbody>
 <tr>
@@ -51,14 +51,15 @@ td {text-align: left}
 
 &shy;
 
-Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org).
+Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).
 
 <img src="/images/blog/arena/chat_demo.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;"></img>
 <p style="color:gray; text-align: center;">Figure 1. The side-by-side chatting and voting interface.</p>
 
 Please note that we periodically release blog posts to update the leaderboard. Feel free to check the following updates:
 - [May 10 Updates](https://lmsys.org/blog/2023-05-10-leaderboard/)
 - [May 25 Updates](https://lmsys.org/blog/2023-05-25-leaderboard/)
+- [June 22 Updates](https://lmsys.org/blog/2023-06-22-leaderboard/)
 
 ## Introduction
 Following the great success of ChatGPT, there has been a proliferation of open-source large language models that are finetuned to follow instructions. These models are capable of providing valuable assistance in response to users’ questions/prompts. Notable examples include Alpaca and Vicuna, based on LLaMA, and OpenAssistant and Dolly, based on Pythia.

diff --git a/blog/2023-05-10-leaderboard.md b/blog/2023-05-10-leaderboard.md
@@ -14,15 +14,15 @@ In this update, we have added 4 new yet strong players into the Arena, including
 - Anthropic Claude-v1
 - RWKV-4-Raven-14B 
 
-Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1iI_IszGAwSMkdfUrIDI6NfTG7tGDDRxZ?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org).
+Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1iI_IszGAwSMkdfUrIDI6NfTG7tGDDRxZ?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).
 
 <style>
 th {text-align: left}
 td {text-align: left}
 </style>
 
 <br>
-<p style="color:gray; text-align: center;">Table 1. Elo ratings of LLMs (Timeframe: April 24 - May 8, 2023)</p>
+<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - May 8, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
 <table style="display: flex; justify-content: center;" align="left" >
 <tbody>
 <tr> <th>Rank</th> <th>Model</th> <th>Elo Rating</th> <th>Description</th> <th>License</th> </tr>

diff --git a/blog/2023-05-25-leaderboard.md b/blog/2023-05-25-leaderboard.md
@@ -15,15 +15,15 @@ In this update, we are excited to welcome the following models joining the [Chat
 A new Elo rating leaderboard based on the 27K anonymous voting data collected **in the wild** between April 24 and May 22, 2023 is released in Table 1 below. 
 
 We provide a [Google Colab notebook](https://colab.research.google.com/drive/17L9uCiAivzWfzOxo2Tb9RMauT7vS6nVU?usp=sharing) to analyze the voting data, including the computation of the Elo ratings.
-You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org).
+You can also try the voting [demo](https://arena.lmsys.org).
 
 <style>
 th {text-align: left}
 td {text-align: left}
 </style>
 
 <br>
-<p style="color:gray; text-align: center;">Table 1. Elo ratings of LLMs (Timeframe: April 24 - May 22, 2023)</p>
+<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - May 22, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
 <table style="display: flex; justify-content: center;" align="left" >
 <tbody>
 <tr> <th>Rank</th> <th>Model</th> <th>Elo Rating</th> <th>Description</th> <th>License</th> </tr>

diff --git a/blog/2023-06-22-leaderboard.md b/blog/2023-06-22-leaderboard.md
@@ -11,7 +11,7 @@ In this blog post, we share the latest update on Chatbot Arena leaderboard, whic
 2. **MT-Bench score**, based on a challenging multi-turn benchmark and GPT-4 grading, proposed and validated in our [Judging LLM-as-a-judge paper](https://arxiv.org/abs/2306.05685).
 3. **MMLU**, a widely adopted [benchmark](https://arxiv.org/abs/2009.03300).
 
-Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations. 
+Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations.
 Their weights are now [available](https://github.com/lm-sys/FastChat/tree/main#vicuna-weights).
 
 ## Updated Leaderboard and New Models
@@ -132,7 +132,7 @@ th:nth-child(1) .arrow-down {
 
 
 <br>
-<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - June 22, 2023). More details at <a href="https://chat.lmsys.org/?leaderboard" target="_blank">our Leaderboard</a>.</p>
+<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - June 19, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
 <table id="Table1" style="display: flex; justify-content: center;" align="left" >
 <tbody>
 <tr> <th>Model</th> <th onclick="sortTable(1, 'Table1')">MT-bench (score) <span class="arrow arrow-down"></span></th> <th onclick="sortTable(2, 'Table1')">Elo Rating <span class="arrow"></span></th> <th onclick="sortTable(3, 'Table1')">MMLU <span class="arrow"></span></th> <th>License</th> </tr>
@@ -198,10 +198,9 @@ th:nth-child(1) .arrow-down {
 
 &shy;
 
-Welcome to check more details on our latest [leaderboard](https://chat.lmsys.org/?leaderboard) and try the [Chatbot Arena](https://chat.lmsys.org/?arena). 
+Welcome to try the Chatbot Arena Voting [Demo](https://chat.lmsys.org/?arena).
 Keep in mind that each benchmark has its limitations. Please consider the results as guiding references. See our discussion below for more technical details.
 
-
 ## Evaluating Chatbots with MT-bench and Arena
 
 ### Motivation

diff --git a/public/images/blog/leaderboard_week8/ability_breakdown.png b/public/images/blog/leaderboard_week8/ability_breakdown.png