diff --git a/blog/2024-03-01-policy.md b/blog/2024-03-01-policy.md index f5dc2690..2c58987c 100644 --- a/blog/2024-03-01-policy.md +++ b/blog/2024-03-01-policy.md @@ -7,14 +7,13 @@ previewImg: /images/blog/arena_policy/arena_logo_v0_4x3.png ## Our Mission -Chatbot Arena ([chat.lmsys.org](https://chat.lmsys.org)) is an open-source project developed by members from [LMSYS](https://chat.lmsys.org/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We launch the evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) periodically. - +Chatbot Arena ([chat.lmsys.org](https://chat.lmsys.org)) is an open-source project developed by members from [LMSYS](https://chat.lmsys.org/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://leaderboard.lmsys.org) periodically. ## Our Progress -Chatbot Arena was first launched in [May 2023](https://lmsys.org/blog/2023-05-03-arena/) and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 300,000 votes across 10 million prompts. This extensive engagement has enabled the evaluation of more than 60 LLMs, such as GPT-4, Gemini/Bard, Llama, and Mistral, significantly enhancing understanding of their capabilities and limitations. +Chatbot Arena was first launched in [May 2023](https://lmsys.org/blog/2023-05-03-arena/) and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 800,000 votes. This extensive engagement has enabled the evaluation of more than 90 LLMs, including both commercial GPT-4, Gemini/Bard and open-weight Llama and Mistral models, significantly enhancing our understanding of their capabilities and limitations. Our periodic [leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of [user preference data](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) and [one million user prompts](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), supporting research and model improvement. @@ -24,24 +23,34 @@ In our ongoing efforts, we feel obligated to establish policies that guarantee e ## Our Policy -
Last Updated: April 11, 2024
+
Last Updated: April 29, 2024
-**Open source**: The platform ([FastChat](https://github.com/lm-sys/FastChat)) including UI frontend, model serving backend and evaluation tools are all open source at GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard. +**Open source**: The platform ([FastChat](https://github.com/lm-sys/FastChat)) including UI frontend, model serving backend, model evaluation and ranking pipelines are all open source and available on GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard. **Transparent**: The evaluation process, including rating computation, identifying anomalous users, and LLM selection are all made publicly available so others can reproduce our analysis and fully understand the process of collecting data. Furthermore, we will involve the community in deciding any changes in the evaluation process. -**Listing models on the leaderboard**: The leaderboard will only include models that are accessible to other third parties. In particular, the leaderboard will only include models that are either (1) open weights or/and (2) publicly available through APIs (e.g., gpt-4-0613, gemini-pro) or services (e.g., Bard, GPT-4+browsing). +**Listing models on the leaderboard**: The public leaderboard will only include models that are accessible to other third parties. Specifically, it will only include models that are either (1) open weights or/and (2) publicly available through APIs (e.g., gpt-4-0613, gemini-pro-api), or (3) available as a service (e.g., Bard, GPT-4+browsing). In the remainder of this document we refer to these models as **publicly released models**. -Once the model is on the leaderboard, the model will remain accessible at [chat.lmsys.org](https://chat.lmsys.org) for at least **two weeks** for the community to evaluate it. +Once a publicly released model is listed on the leaderboard, the model will remain accessible at [chat.lmsys.org](https://chat.lmsys.org) for at least **two weeks** for the community to evaluate it. -Before a model is published on the leaderboard, we need to accumulate enough votes to compute its rating. We host the model in the blind test mode. This is called the initial-rating phase. If the model provider decides to pull out and not show the model on the leaderboard, we will allow it, but we might still share with the community the data generated by the model during the initial-rating phase under the "anonymous" label. Note that this only applies to proprietary models under private APIs or pre-release open models. It does not apply to models offered via public APIs or open weight models which can be evaluated by anyone. +**Evaluating publicly released models**. Evaluating such a model consists of the following steps: +1. Add the model to Arena for blind testing and let the community know it was added. +2. Accumulate enough votes until the model's rating stabilizes. +3. Once the model's rating stabilizes, we list the model on the public leaderboard. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard. -To ensure the leaderboard correctly reflects model rankings over time, we rely on live comparisons between models. We may retire models from the leaderboard that are no longer online after a certain time period. +**Evaluating unreleased models**: We allow model providers to test their unreleased models anonymously (i.e., the model's name will be anonymized). A model is unreleased if its weights are neither open, nor available via a public API or service. Evaluating an unreleased model consists of the following steps: +1. Add the model to Arena with an anonymous label. i.e., its identity will not be shown to users. +2. Keep it until we accumulate enough votes for its rating to stabilize or until the model provider withdraws it. +3. Once we accumulate enough votes, we will share the result privately with the model provider. These include the rating, as well as release samples of up to 20% of the votes. (See Sharing data with the model providers for further details). +4. Remove the model from Arena. +If while we test an unreleased model, that model is publicly released, we immediately switch to the publicly released model evaluation process. + +To ensure the leaderboard correctly reflects model rankings over time, we rely on live comparisons between models. We may retire models from the leaderboard that are no longer online after a certain time period. **Sharing data with the community**: We will periodically share data with the community. In particular, we will periodically share 20% of the arena vote data we have collected including the prompts, the answers, the identity of the model providing each answer (if the model is or has been on the leaderboard), and the votes. For the models we collected votes for but have never been on the leaderboard, we will still release data but we will label the model as "anonymous". -**Sharing data with the model providers**: Upon request, we will offer early data access with model providers who wish to improve their models. However, this data will be a subset of data that we periodically share with the community. In particular, with a model provider, we will share the data that includes their model's answers. For battles, we will not identify the opponent. We will label the opponent as "anonymous". This data will be later shared with the community during the periodic releases. If the model is not on the leaderboard at the time of sharing, the model’s answers will be labeled as "anonymous". +**Sharing data with the model providers**: Upon request, we will offer early data access with model providers who wish to improve their models. However, this data will be a subset of data that we periodically share with the community. In particular, with a model provider, we will share the data that includes their model's answers. For battles, we may not reveal the opponent model and may use "anonymous" label. This data will be later shared with the community during the periodic releases. If the model is not on the leaderboard at the time of sharing, the model’s answers will also be labeled as "anonymous". ## FAQ @@ -55,7 +64,7 @@ We will continuously add new models and retire old ones. It is not feasible to a We seek to provide transparency and all tools as well as the platform we are using in open-source. We invite the community to use our platform and tools to statistically reproduce our results. ### Why do you only share 20% of data, not all? -Arena's mission is to ensure trustable evaluation. We periodically share data to mitigate the potential risk of overfitting certain user distributions or preference biases in Arena. We will actively review this policy based on the community's feedback. +We periodically share data to mitigate the potential risk of benchmark leakage. We will actively review this policy based on the community's feedback. ### Who will fund this effort? Any conflict of interests? Chatbot Arena is only funded by gifts, in money, cloud credits, or API credits. The gifts have no strings attached.