You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've posted results below from running arena-hard this morning that don't make sense. Namely, the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma-2-27b-it or gemma-2-9b-it. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b-it, gpt-4, but not gemma-9b-it.
Did i do something wrong? Can someone verify please?
As baselines, I ran gemma-2-9b-it and gemma-2-27b-it from SGLang, e.g. python -m sglang.launch_server --model-path google/gemma-2-9b-it --port 10022.
I then put these in the api_config.yaml, the gen_answer_config.yaml, and the judge_config.yaml:
Finally, I ran these as instructed with python gen_answer.py and python gen_judgement.py. Afterwards, here are my results. You can see that the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b, gpt-4, but not gemma-9b-it.
Style Control: The statistical technique behind Style Control varies depending on the model pool. See Issue 50 for an explanation. Style Control blogpost for additional details.
The score for Gemma-2-27b-it and Gemma-2-9b-it reported is generated via Google's api which is the same api we use on Chatbot Arena. It is possible SGLang's Gemma-2-27b-it performs differently. However, it seems like the difference is big, which I will investigate.
I've posted results below from running arena-hard this morning that don't make sense. Namely, the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma-2-27b-it or gemma-2-9b-it. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b-it, gpt-4, but not gemma-9b-it.
Did i do something wrong? Can someone verify please?
As baselines, I ran gemma-2-9b-it and gemma-2-27b-it from SGLang, e.g.
python -m sglang.launch_server --model-path google/gemma-2-9b-it --port 10022
.I then put these in the api_config.yaml, the gen_answer_config.yaml, and the judge_config.yaml:
api_config
gen_answer_config
judge_config
Finally, I ran these as instructed with
python gen_answer.py
andpython gen_judgement.py
. Afterwards, here are my results. You can see that the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b, gpt-4, but not gemma-9b-it.python show_results.py --style-control
:python show_results.py
:The text was updated successfully, but these errors were encountered: