Conflicting results with the leaderboard. #54

cinjon · 2024-12-10T18:52:49Z

I've posted results below from running arena-hard this morning that don't make sense. Namely, the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma-2-27b-it or gemma-2-9b-it. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b-it, gpt-4, but not gemma-9b-it.

Did i do something wrong? Can someone verify please?

As baselines, I ran gemma-2-9b-it and gemma-2-27b-it from SGLang, e.g. python -m sglang.launch_server --model-path google/gemma-2-9b-it --port 10022.

I then put these in the api_config.yaml, the gen_answer_config.yaml, and the judge_config.yaml:

api_config

gemma-9b-it-reg:
    model_name: gemma-9b-it-reg
    endpoints:
        - api_base: http://localhost:10022/v1
          api_key: "UNUSED"
    api_type: openai
    tokenizer: "google/gemma-2-9b-it"
    parallel: 8

gemma-27b-it-reg:
    model_name: gemma-27b-it-reg
    endpoints:
        - api_base: http://localhost:10021/v1
          api_key: "UNUSED"
    api_type: openai
    tokenizer: "google/gemma-2-9b-it"
    parallel: 8

gen_answer_config

model_list:
  - gpt-4o-2024-08-06
  - gemma-9b-it-reg
  - gemma-27b-it-reg

judge_config

model_list:
  - gpt-4o-2024-08-06
  - gemma-9b-it-reg
  - gemma-27b-it-reg

Finally, I ran these as instructed with python gen_answer.py and python gen_judgement.py. Afterwards, here are my results. You can see that the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b, gpt-4, but not gemma-9b-it.

python show_results.py --style-control:

gpt-4o-2024-08-06              | score: 73.4  | 95% CI: (-2.5, 2.4)  | average #tokens: 594
gemma-27b-it-reg               | score: 59.4  | 95% CI: (-2.7, 2.6)  | average #tokens: 578
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423
gemma-9b-it-reg                | score: 43.1  | 95% CI: (-2.1, 2.6)  | average #tokens: 547

python show_results.py:

gpt-4o-2024-08-06              | score: 77.2  | 95% CI: (-2.2, 2.0)  | average #tokens: 594
gemma-27b-it-reg               | score: 58.6  | 95% CI: (-2.7, 2.4)  | average #tokens: 578
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423
gemma-9b-it-reg                | score: 41.5  | 95% CI: (-2.2, 2.2)  | average #tokens: 547

The text was updated successfully, but these errors were encountered:

CodingWithTim · 2024-12-14T04:21:03Z

@cinjon Hi there!

Style Control: The statistical technique behind Style Control varies depending on the model pool. See Issue 50 for an explanation. Style Control blogpost for additional details.
The score for Gemma-2-27b-it and Gemma-2-9b-it reported is generated via Google's api which is the same api we use on Chatbot Arena. It is possible SGLang's Gemma-2-27b-it performs differently. However, it seems like the difference is big, which I will investigate.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflicting results with the leaderboard. #54

Conflicting results with the leaderboard. #54

cinjon commented Dec 10, 2024

CodingWithTim commented Dec 14, 2024 •

edited

Loading

Conflicting results with the leaderboard. #54

Conflicting results with the leaderboard. #54

Comments

cinjon commented Dec 10, 2024

CodingWithTim commented Dec 14, 2024 • edited Loading

CodingWithTim commented Dec 14, 2024 •

edited

Loading