Skip to content

Commit

Permalink
Update table captions and ability_breakdown.png
Browse files Browse the repository at this point in the history
  • Loading branch information
merrymercy committed Jun 22, 2023
1 parent f31b098 commit e32a39b
Show file tree
Hide file tree
Showing 5 changed files with 10 additions and 10 deletions.
5 changes: 3 additions & 2 deletions blog/2023-05-03-arena.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ td {text-align: left}
</style>

<br>
<p style="color:gray; text-align: center;">Table 1. Elo ratings of popular open-source large language models. (Timeframe: April 24 - May 1, 2023)</p>
<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - May 1, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
<table style="display: flex; justify-content: center;" align="left" >
<tbody>
<tr>
Expand Down Expand Up @@ -51,14 +51,15 @@ td {text-align: left}

&shy;

Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org).
Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).

<img src="/images/blog/arena/chat_demo.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;"></img>
<p style="color:gray; text-align: center;">Figure 1. The side-by-side chatting and voting interface.</p>

Please note that we periodically release blog posts to update the leaderboard. Feel free to check the following updates:
- [May 10 Updates](https://lmsys.org/blog/2023-05-10-leaderboard/)
- [May 25 Updates](https://lmsys.org/blog/2023-05-25-leaderboard/)
- [June 22 Updates](https://lmsys.org/blog/2023-06-22-leaderboard/)

## Introduction
Following the great success of ChatGPT, there has been a proliferation of open-source large language models that are finetuned to follow instructions. These models are capable of providing valuable assistance in response to users’ questions/prompts. Notable examples include Alpaca and Vicuna, based on LLaMA, and OpenAssistant and Dolly, based on Pythia.
Expand Down
4 changes: 2 additions & 2 deletions blog/2023-05-10-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ In this update, we have added 4 new yet strong players into the Arena, including
- Anthropic Claude-v1
- RWKV-4-Raven-14B

Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1iI_IszGAwSMkdfUrIDI6NfTG7tGDDRxZ?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org).
Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1iI_IszGAwSMkdfUrIDI6NfTG7tGDDRxZ?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).

<style>
th {text-align: left}
td {text-align: left}
</style>

<br>
<p style="color:gray; text-align: center;">Table 1. Elo ratings of LLMs (Timeframe: April 24 - May 8, 2023)</p>
<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - May 8, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
<table style="display: flex; justify-content: center;" align="left" >
<tbody>
<tr> <th>Rank</th> <th>Model</th> <th>Elo Rating</th> <th>Description</th> <th>License</th> </tr>
Expand Down
4 changes: 2 additions & 2 deletions blog/2023-05-25-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ In this update, we are excited to welcome the following models joining the [Chat
A new Elo rating leaderboard based on the 27K anonymous voting data collected **in the wild** between April 24 and May 22, 2023 is released in Table 1 below.

We provide a [Google Colab notebook](https://colab.research.google.com/drive/17L9uCiAivzWfzOxo2Tb9RMauT7vS6nVU?usp=sharing) to analyze the voting data, including the computation of the Elo ratings.
You can also try the voting [demo](https://arena.lmsys.org) and see the latest [leaderboard](https://leaderboard.lmsys.org).
You can also try the voting [demo](https://arena.lmsys.org).

<style>
th {text-align: left}
td {text-align: left}
</style>

<br>
<p style="color:gray; text-align: center;">Table 1. Elo ratings of LLMs (Timeframe: April 24 - May 22, 2023)</p>
<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - May 22, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
<table style="display: flex; justify-content: center;" align="left" >
<tbody>
<tr> <th>Rank</th> <th>Model</th> <th>Elo Rating</th> <th>Description</th> <th>License</th> </tr>
Expand Down
7 changes: 3 additions & 4 deletions blog/2023-06-22-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ In this blog post, we share the latest update on Chatbot Arena leaderboard, whic
2. **MT-Bench score**, based on a challenging multi-turn benchmark and GPT-4 grading, proposed and validated in our [Judging LLM-as-a-judge paper](https://arxiv.org/abs/2306.05685).
3. **MMLU**, a widely adopted [benchmark](https://arxiv.org/abs/2009.03300).

Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations.
Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations.
Their weights are now [available](https://github.com/lm-sys/FastChat/tree/main#vicuna-weights).

## Updated Leaderboard and New Models
Expand Down Expand Up @@ -132,7 +132,7 @@ th:nth-child(1) .arrow-down {


<br>
<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - June 22, 2023). More details at <a href="https://chat.lmsys.org/?leaderboard" target="_blank">our Leaderboard</a>.</p>
<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - June 19, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
<table id="Table1" style="display: flex; justify-content: center;" align="left" >
<tbody>
<tr> <th>Model</th> <th onclick="sortTable(1, 'Table1')">MT-bench (score) <span class="arrow arrow-down"></span></th> <th onclick="sortTable(2, 'Table1')">Elo Rating <span class="arrow"></span></th> <th onclick="sortTable(3, 'Table1')">MMLU <span class="arrow"></span></th> <th>License</th> </tr>
Expand Down Expand Up @@ -198,10 +198,9 @@ th:nth-child(1) .arrow-down {

&shy;

Welcome to check more details on our latest [leaderboard](https://chat.lmsys.org/?leaderboard) and try the [Chatbot Arena](https://chat.lmsys.org/?arena).
Welcome to try the Chatbot Arena Voting [Demo](https://chat.lmsys.org/?arena).
Keep in mind that each benchmark has its limitations. Please consider the results as guiding references. See our discussion below for more technical details.


## Evaluating Chatbots with MT-bench and Arena

### Motivation
Expand Down
Binary file modified public/images/blog/leaderboard_week8/ability_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e32a39b

Please sign in to comment.