Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting results with the leaderboard. #54

Open
cinjon opened this issue Dec 10, 2024 · 1 comment
Open

Conflicting results with the leaderboard. #54

cinjon opened this issue Dec 10, 2024 · 1 comment

Comments

@cinjon
Copy link

cinjon commented Dec 10, 2024

I've posted results below from running arena-hard this morning that don't make sense. Namely, the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma-2-27b-it or gemma-2-9b-it. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b-it, gpt-4, but not gemma-9b-it.

Did i do something wrong? Can someone verify please?


As baselines, I ran gemma-2-9b-it and gemma-2-27b-it from SGLang, e.g. python -m sglang.launch_server --model-path google/gemma-2-9b-it --port 10022.

I then put these in the api_config.yaml, the gen_answer_config.yaml, and the judge_config.yaml:

api_config

gemma-9b-it-reg:
    model_name: gemma-9b-it-reg
    endpoints:
        - api_base: http://localhost:10022/v1
          api_key: "UNUSED"
    api_type: openai
    tokenizer: "google/gemma-2-9b-it"
    parallel: 8

gemma-27b-it-reg:
    model_name: gemma-27b-it-reg
    endpoints:
        - api_base: http://localhost:10021/v1
          api_key: "UNUSED"
    api_type: openai
    tokenizer: "google/gemma-2-9b-it"
    parallel: 8    

gen_answer_config

model_list:
  - gpt-4o-2024-08-06
  - gemma-9b-it-reg
  - gemma-27b-it-reg

judge_config

model_list:
  - gpt-4o-2024-08-06
  - gemma-9b-it-reg
  - gemma-27b-it-reg

Finally, I ran these as instructed with python gen_answer.py and python gen_judgement.py. Afterwards, here are my results. You can see that the style-control results agree with the existing leaderboard for both gpt-4o and gpt-4, but not for either gemma. The non style-control results agree with the leaderboard for gpt-4o, gemma-27b, gpt-4, but not gemma-9b-it.

python show_results.py --style-control:

gpt-4o-2024-08-06              | score: 73.4  | 95% CI: (-2.5, 2.4)  | average #tokens: 594
gemma-27b-it-reg               | score: 59.4  | 95% CI: (-2.7, 2.6)  | average #tokens: 578
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423
gemma-9b-it-reg                | score: 43.1  | 95% CI: (-2.1, 2.6)  | average #tokens: 547

python show_results.py:

gpt-4o-2024-08-06              | score: 77.2  | 95% CI: (-2.2, 2.0)  | average #tokens: 594
gemma-27b-it-reg               | score: 58.6  | 95% CI: (-2.7, 2.4)  | average #tokens: 578
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423
gemma-9b-it-reg                | score: 41.5  | 95% CI: (-2.2, 2.2)  | average #tokens: 547
@CodingWithTim
Copy link
Collaborator

CodingWithTim commented Dec 14, 2024

@cinjon Hi there!

  1. Style Control: The statistical technique behind Style Control varies depending on the model pool. See Issue 50 for an explanation. Style Control blogpost for additional details.
  2. The score for Gemma-2-27b-it and Gemma-2-9b-it reported is generated via Google's api which is the same api we use on Chatbot Arena. It is possible SGLang's Gemma-2-27b-it performs differently. However, it seems like the difference is big, which I will investigate.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants