Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
infwinston committed Apr 30, 2024
1 parent 1c034cb commit 08a1ae2
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion blog/2024-03-01-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Chatbot Arena was first launched in [May 2023](https://lmsys.org/blog/2023-05-03

Our periodic [leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of [user preference data](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) and [one million user prompts](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), supporting research and model improvement.

We also collaborate with open-source and commercial model providers to bring their latest models to community for preview testing. We believe this initiative helps advancing the field and encourages user engagement to collect crucial votes for evaluating all the models in the Arena. Moreover, it provides an opportunity for the community to test and provide anonymized feedback before the models are officially released.

The platform's infrastructure ([FastChat](https://github.com/lm-sys/FastChat)) and evaluation tools, available on GitHub, emphasize our dedication to transparency and community engagement in the evaluation process. This approach not only enhances the reliability of our findings but also fosters a collaborative environment for advancing LLMs.

In our ongoing efforts, we feel obligated to establish policies that guarantee evaluation transparency and trustworthiness. Moreover, we actively involve the community in shaping any modifications to the evaluation process, reinforcing our commitment to openness and collaborative progress.
Expand All @@ -38,7 +40,9 @@ Once a publicly released model is listed on the leaderboard, the model will rema
2. Accumulate enough votes until the model's rating stabilizes.
3. Once the model's rating stabilizes, we list the model on the public leaderboard. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.

**Evaluating unreleased models**: We allow model providers to test their unreleased models anonymously (i.e., the model's name will be anonymized). A model is unreleased if its weights are neither open, nor available via a public API or service. Evaluating an unreleased model consists of the following steps:
**Evaluating unreleased models**: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.

Model providers can test their unreleased models anonymously, meaning the models' names will be anonymized. A model is considered unreleased if its weights are neither open, nor available via a public API or service. Evaluating an unreleased model consists of the following steps:
1. Add the model to Arena with an anonymous label. i.e., its identity will not be shown to users.
2. Keep it until we accumulate enough votes for its rating to stabilize or until the model provider withdraws it.
3. Once we accumulate enough votes, we will share the result privately with the model provider. These include the rating, as well as release samples of up to 20% of the votes. (See Sharing data with the model providers for further details).
Expand Down

0 comments on commit 08a1ae2

Please sign in to comment.