diff --git a/docs/leaderboard.md b/docs/leaderboard.md index b133034..1f30341 100644 --- a/docs/leaderboard.md +++ b/docs/leaderboard.md @@ -4,25 +4,25 @@ # SciCode Leaderboard -| Models | Main Problem Resolve Rate | Subproblem | +| Models | Main Problem Resolve Rate | Subproblem | |--------------------------|-------------------------------------|-------------------------------------| -| 🥇 OpenAI o1-preview |
7.7
|
28.5
| -| 🥈 Claude3.5-Sonnet |
4.6
|
26.0
| -| 🥉 Claude3.5-Sonnet (new) |
4.6
|
25.3
| -| Deepseek-Coder-v2 |
3.1
|
21.2
| -| GPT-4o |
1.5
|
25.0
| -| GPT-4-Turbo |
1.5
|
22.9
| -| OpenAI o1-mini |
1.5
|
22.2
| -| Gemini 1.5 Pro |
1.5
|
21.9
| -| Claude3-Opus |
1.5
|
21.5
| -| Llama-3.1-405B-Chat |
1.5
|
19.8
| -| Claude3-Sonnet |
1.5
|
17.0
| -| Qwen2-72B-Instruct |
1.5
|
17.0
| -| Llama-3.1-70B-Chat |
0.0
|
17.0
| -| Mixtral-8x22B-Instruct |
0.0
|
16.3
| -| Llama-3-70B-Chat |
0.0
|
14.6
| +| 🥇 OpenAI o1-preview |
**7.7**
|
28.5
| +| 🥈 Claude3.5-Sonnet |
**4.6**
|
26.0
| +| 🥉 Claude3.5-Sonnet (new) |
**4.6**
|
25.3
| +| Deepseek-Coder-v2 |
**3.1**
|
21.2
| +| GPT-4o |
**1.5**
|
25.0
| +| GPT-4-Turbo |
**1.5**
|
22.9
| +| OpenAI o1-mini |
**1.5**
|
22.2
| +| Gemini 1.5 Pro |
**1.5**
|
21.9
| +| Claude3-Opus |
**1.5**
|
21.5
| +| Llama-3.1-405B-Chat |
**1.5**
|
19.8
| +| Claude3-Sonnet |
**1.5**
|
17.0
| +| Qwen2-72B-Instruct |
**1.5**
|
17.0
| +| Llama-3.1-70B-Chat |
**0.0**
|
17.0
| +| Mixtral-8x22B-Instruct |
**0.0**
|
16.3
| +| Llama-3-70B-Chat |
**0.0**
|
14.6
| -Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems. +**Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems.**