Skip to content

Commit

Permalink
update links (#114)
Browse files Browse the repository at this point in the history
* Update 2024-03-01-policy.md

* update link

* update link
  • Loading branch information
infwinston authored Aug 26, 2024
1 parent cc5c5f9 commit fc7536d
Show file tree
Hide file tree
Showing 15 changed files with 22 additions and 22 deletions.
8 changes: 4 additions & 4 deletions blog/2023-05-03-arena.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ td {text-align: left}

­

Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).
Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). You can also try the voting [demo](https://lmarena.ai).

<img src="/images/blog/arena/chat_demo.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;"></img>
<p style="color:gray; text-align: center;">Figure 1. The side-by-side chatting and voting interface.</p>
Expand Down Expand Up @@ -103,7 +103,7 @@ To collect data, we launched the arena with several popular open-source LLMs one
</div>

## Data Collection
We hosted the arena at [https://arena.lmsys.org](https://arena.lmsys.org) with our multi-model serving system, [FastChat](https://github.com/lm-sys/FastChat). When a user enters the arena, they can chat with two anonymous models side-by-side, as shown in Figure 1.
We hosted the arena at [https://lmarena.ai](https://lmarena.ai) with our multi-model serving system, [FastChat](https://github.com/lm-sys/FastChat). When a user enters the arena, they can chat with two anonymous models side-by-side, as shown in Figure 1.
After getting responses from the two models, users can continue chatting or vote for the model they think is better. Once a vote is submitted, the model names will be revealed. Users can continue chatting or restart a new battle with two new randomly chosen anonymous models. The platform logs all user interactions. In our analysis, we only use the votes when the model names are hidden.

The arena was launched about one week ago and we have collected 4.7k valid anonymous votes since then. We share some exploratory analysis in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) and present a short summary here.
Expand Down Expand Up @@ -152,13 +152,13 @@ We plan to work on the following items:
We appreciate any feedback from you to make the arena better.

## Join Us
We invite the entire community to join this benchmarking effort by contributing your models and votes for the anonymous models you think provide better answers. You can visit [https://arena.lmsys.org](https://arena.lmsys.org) to vote for better models. If you want to see a specific model in the arena, you can follow this [guide](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) to help us add it.
We invite the entire community to join this benchmarking effort by contributing your models and votes for the anonymous models you think provide better answers. You can visit [https://lmarena.ai](https://lmarena.ai) to vote for better models. If you want to see a specific model in the arena, you can follow this [guide](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) to help us add it.

## Acknowledgment
We thank other members of the Vicuna team for valuable feedback and MBZUAI for donating compute resources. Additionally, we extend our thanks to Tianjun Zhang and Eric Wallace for their insightful discussions.

## Links
- Demo: [https://arena.lmsys.org](https://arena.lmsys.org)
- Demo: [https://lmarena.ai](https://lmarena.ai)
- Leaderboard: [https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
- GitHub: [https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)
- Colab notebook: [https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing)
Expand Down
2 changes: 1 addition & 1 deletion blog/2023-05-10-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ In this update, we have added 4 new yet strong players into the Arena, including
- Anthropic Claude-v1
- RWKV-4-Raven-14B

Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).
Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). You can also try the voting [demo](https://lmarena.ai).

<style>
th {text-align: left}
Expand Down
2 changes: 1 addition & 1 deletion blog/2023-05-25-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ In this update, we are excited to welcome the following models joining the [Chat
A new Elo rating leaderboard based on the 27K anonymous voting data collected **in the wild** between April 24 and May 22, 2023 is released in Table 1 below.

We provide a [Google Colab notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) to analyze the voting data, including the computation of the Elo ratings.
You can also try the voting [demo](https://arena.lmsys.org).
You can also try the voting [demo](https://lmarena.ai).

<style>
th {text-align: left}
Expand Down
4 changes: 2 additions & 2 deletions blog/2023-06-22-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ th:nth-child(1) .arrow-down {

&shy;

Welcome to try the Chatbot Arena voting [demo](https://chat.lmsys.org/?arena).
Welcome to try the Chatbot Arena voting [demo](https://lmarena.ai).
Keep in mind that each benchmark has its limitations. Please consider the results as guiding references. See our discussion below for more technical details.

## Evaluating Chatbots with MT-bench and Arena
Expand All @@ -190,7 +190,7 @@ Traditional benchmarks often test LLMs on close-ended questions with concise out

To fill this gap, in this leaderboard update, in addition to the Chatbot Arena Elo system, we add a new benchmark: MT-Bench.
- [MT-bench](https://arxiv.org/abs/2306.05685) is a challenging multi-turn question set designed to evaluate the conversational and instruction-following ability of models. You can view sample questions and answers of MT-bench [here](https://huggingface.co/spaces/lmsys/mt-bench).
- [Chatbot Arena](https://chat.lmsys.org/?arena) is a crowd-sourced battle platform, where users ask chatbots any question and vote for their preferred answer.
- [Chatbot Arena](https://lmarena.ai) is a crowd-sourced battle platform, where users ask chatbots any question and vote for their preferred answer.

Both benchmarks are designed to use human preferences as the primary metric.

Expand Down
2 changes: 1 addition & 1 deletion blog/2023-07-20-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Therefore, we think our datasets are highly valuable due to the expensive nature

We are hosting the latest leaderboard at [lmsys/chatbot-arena-leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard). Below is a screenshot. Since the last update, we added two 30B models: Vicuna-33B-v1.3 and MPT-30B-chat, both of which perform very well in the arena.
Two days ago, we also introduced Llama 2 and Claude 2 to the arena. The leaderboard will soon include them after we get enough votes.
Please help us by casting your votes at our voting [website](https://chat.lmsys.org/?arena).
Please help us by casting your votes at our voting [website](https://lmarena.ai).

Besides the slowly updated Arena Elo ratings, we also use MT-bench, a fast GPT-4 based automatic evaluation pipeline to evaluate all new models, including LLama 2 (chat), Claude 2, WizardLM-13B-v1.1, XGen-7B-8K-Inst, and ChatGLM2-6B.
You are welcome to check out the interactive [lmsys/chatbot-arena-leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) to sort the models according to different metrics.
Expand Down
2 changes: 1 addition & 1 deletion blog/2023-10-30-toxicchat.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ For example, the following prompts do not include specific toxic words but will

Therefore, it is critical to develop toxicity benchmarks rooted in real-world user-AI dialogues, which can help develop a better conversational AI system for addressing toxic behavior embedded within this specific conversation context.

In this work, we conduct a benchmark study focused on toxicity in real-world user-AI interactions. We create a comprehensive toxicity benchmark ToxicChat based on real chat data from the Vicuna and Chatbot Arena [demo](https://chat.lmsys.org/), which can be utilized to understand user behaviors and improve the performance of moderation for AI chatbots. The dataset can be downloaded at <https://huggingface.co/datasets/lmsys/toxic-chat>.
In this work, we conduct a benchmark study focused on toxicity in real-world user-AI interactions. We create a comprehensive toxicity benchmark ToxicChat based on real chat data from the Vicuna and Chatbot Arena [demo](https://lmarena.ai/), which can be utilized to understand user behaviors and improve the performance of moderation for AI chatbots. The dataset can be downloaded at <https://huggingface.co/datasets/lmsys/toxic-chat>.

## Data Collection

Expand Down
4 changes: 2 additions & 2 deletions blog/2023-12-07-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ In November, we added record-breaking nine new models with sizes ranging from 7B
On the other hand, 7B models have also shown significant improvements. Fine-tuning the 7B Mistral model has led to Zephyr, OpenChat-3.5, Starling-lm-7b-alpha, and OpenHermes-2.5-Mistral-7b which all demonstrate impressive performance despite smaller scale. Shoutout to the open-source community pushing limits! On the other hand, to understand how freshness and grounded information help LLMs in answering user queries, we also bring Perplexity AI’s online LLMs to Arena. We have collected over 1500 votes for PPLX-70B-Online and the preliminary results show great potential.
Congrats to all the teams and we look forward to seeing more models in the future!

Please find the latest leaderboard [here](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) or try [Arena demo](https://chat.lmsys.org) to chat with 20+ models!
Please find the latest leaderboard [here](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) or try [Arena demo](https://lmarena.ai) to chat with 20+ models!
We also prepare a [notebook](https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH) to reproduce all the calculation of Elo ratings and confidence intervals.

<img src="/images/blog/leaderboard_202312/mle_elo.png" style="display:block; margin:auto; max-width:80%; height:auto;"></img>
Expand Down Expand Up @@ -145,7 +145,7 @@ We plan to ship real-time leaderboard update, diving deeper into user prompt ana


## Links
- [Chatbot Arena Demo](https://chat.lmsys.org/)
- [Chatbot Arena Demo](https://lmarena.ai/)
- [Arena Elo Colab](https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=mukqgshMarFi)
- [How Is ChatGPT's Behavior Changing over Time?](https://arxiv.org/abs/2307.09009)
- Bradley-Terry model [lecture note](https://web.stanford.edu/class/archive/stats/stats200/stats200.1172/Lecture24.pdf), [paper](https://www.jstor.org/stable/2334029)
Expand Down
4 changes: 2 additions & 2 deletions blog/2024-03-01-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ previewImg: /images/blog/arena_policy/arena_logo_v0_4x3.png

## Our Mission

Chatbot Arena ([chat.lmsys.org](https://chat.lmsys.org)) is an open-source project developed by members from [LMSYS](https://chat.lmsys.org/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://leaderboard.lmsys.org) periodically.
Chatbot Arena ([lmarena.ai](https://lmarena.ai)) is an open-source project developed by members from [LMSYS](https://lmarena.ai/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://lmarena.ai/?leaderboard) periodically.

<img src="/images/blog/arena_policy/arena_logo_v0_4x3.png" style="width: 50%; max-width: 50%; margin-left: auto; margin-right: auto; margin-bottom: auto"></img>

Expand All @@ -33,7 +33,7 @@ In our ongoing efforts, we feel obligated to establish policies that guarantee e

**Listing models on the leaderboard**: The public leaderboard will only include models that are accessible to other third parties. Specifically, it will only include models that are either (1) open weights or/and (2) publicly available through APIs (e.g., gpt-4-0613, gemini-pro-api), or (3) available as a service (e.g., Bard, GPT-4+browsing). In the remainder of this document we refer to these models as **publicly released models**.

Once a publicly released model is listed on the leaderboard, the model will remain accessible at [chat.lmsys.org](https://chat.lmsys.org) for at least **two weeks** for the community to evaluate it.
Once a publicly released model is listed on the leaderboard, the model will remain accessible at [lmarena.ai](https://lmarena.ai) for at least **two weeks** for the community to evaluate it.

**Evaluating publicly released models**. Evaluating such a model consists of the following steps:
1. Add the model to Arena for blind testing and let the community know it was added.
Expand Down
2 changes: 1 addition & 1 deletion blog/2024-04-19-arena-hard.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ An agreement score of 1 implies benchmark A confidently agrees on the preference

We define **separability** by whether a benchmark can separate given model pairs with derived confidence intervals (via bootstrapping). This metric can also serve to measure the variances in ranking outputs provided by a benchmark. We quantify this metric by the percentage of model pairs which have non-overlapping confidence intervals of the benchmark scores.

We use a set of top-20 models* on [Chatbot Arena](https://chat.lmsys.org/?leaderboard) (April 13, 2024) that are presented on [AlpacaEval leaderboard](https://tatsu-lab.github.io/alpaca_eval/) to calculate separability and agreement per benchmark. We consider the human preference ranking by Chatbot Arena (English only) as the reference to calculate agreement.
We use a set of top-20 models* on [Chatbot Arena](https://lmarena.ai/?leaderboard) (April 13, 2024) that are presented on [AlpacaEval leaderboard](https://tatsu-lab.github.io/alpaca_eval/) to calculate separability and agreement per benchmark. We consider the human preference ranking by Chatbot Arena (English only) as the reference to calculate agreement.

In Table 1, Arena-hard-Auto-v0.1 shows the highest separability (87.4%) against widely adopted LLM benchmarks and offers highest agreement (89.1%) to Chatbot Arena. It is also cheap and fast to run ($25).

Expand Down
2 changes: 1 addition & 1 deletion blog/2024-05-02-kaggle-competition.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ previewImg: /images/blog/kaggle_competition/thumb_4x.png

### Overview

LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the [Chatbot Arena](https://chat.lmsys.org), containing conversations and user preferences across various LLMs. By developing a model that accurately predicts human preferences, you'll contribute to improving chatbot performance and alignment with user expectations. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences, with personally identifiable information removed. Your solution submission will be tested on a hidden test set of 25,000 samples.
LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the [Chatbot Arena](https://lmarena.ai), containing conversations and user preferences across various LLMs. By developing a model that accurately predicts human preferences, you'll contribute to improving chatbot performance and alignment with user expectations. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences, with personally identifiable information removed. Your solution submission will be tested on a hidden test set of 25,000 samples.
The dataset includes real-world conversations with over 70 state-of-the-art LLMs, such as GPT-4, Claude 2, Llama 2, Gemini, and Mistral models. [Click here to join the competition](https://www.kaggle.com/competitions/lmsys-chatbot-arena/overview) and download the dataset!

<img src="/images/blog/kaggle_competition/header_4x.png" style="width: 60%; max-width: 60%; margin-left: auto; margin-right: auto; margin-top: 0px; margin-bottom: 0px"></img>
Expand Down
2 changes: 1 addition & 1 deletion blog/2024-06-27-multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ previewImg: /images/blog/vision_arena/llama_gallery.png

### Multimodal Chatbot Arena

We added image support to [Chatbot Arena](https://chat.lmsys.org/)! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother.
We added image support to [Chatbot Arena](https://lmarena.ai/)! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother.

In just two weeks, we have collected **over 17,000 user preference votes across over 60 languages**. In this post we show the initial leaderboard and statistics, some interesting conversations submitted to the arena, and include a short discussion on the future of the multimodal arena.

Expand Down
2 changes: 1 addition & 1 deletion blog/2024-07-01-routellm.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ In our routing setup, we focus on the case where there are two models: a stronge

This is best understood through Figure 2, which represents the performance of a router that randomly routes between the two models on MT Bench. Specifically, we route between GPT-4 and Mixtral 8x7B here, with their performance denoted by the red and grey dotted lines respectively. For any router, we can plot a similar graph of its performance against the number of the calls made to GPT-4 (which is representative of the cost incurred since the cost of a Mixtral call is negligible).

We use *preference data* for training our routers, building upon previous works ([1](https://arxiv.org/abs/2404.14618),[2](https://huyenchip.com/2024/02/28/predictive-human-preference.html)). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge.
We use *preference data* for training our routers, building upon previous works ([1](https://arxiv.org/abs/2404.14618),[2](https://huyenchip.com/2024/02/28/predictive-human-preference.html)). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://lmarena.ai). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge.

We trained four routers using a mix of Chatbot Arena data and data augmentation:
- A similarity-weighted (SW) ranking router that performs a “weighted Elo calculation” based on similarity
Expand Down
Loading

0 comments on commit fc7536d

Please sign in to comment.