Skip to content

Latest commit

 

History

History
65 lines (65 loc) · 14.3 KB

zebra-grid.summary.md

File metadata and controls

65 lines (65 loc) · 14.3 KB
Model Mode N_Mode N_Size Puzzle Acc Small Puzzle Acc Medium Puzzle Acc Large Puzzle Acc XL Puzzle Acc Cell Acc No answer Total Puzzles Reason Lens
o1-2024-12-17 greedy single 1 81 97.19 92.14 78 42.5 78.74 0.2 1000 1197.51
deepseek-R1 greedy single 1 78.7 98.44 95.71 73.5 28.5 80.54 0 1000 586.33
o1-preview-2024-09-12 greedy single 1 71.4 98.12 88.21 59.5 17 75.14 0.3 1000 1565.88
o1-preview-2024-09-12-v2 greedy single 1 70.4 97.81 88.57 55.5 16 74.18 0.4 1000 1559.71
o1-mini-2024-09-12-v3 greedy single 1 59.7 87.5 76.79 39 12 70.32 1 1000 1166.38
o1-mini-2024-09-12-v2 greedy single 1 56.8 83.44 76.43 36 7.5 69.87 1.3 1000 1164.95
o1-mini-2024-09-12 greedy single 1 52.6 87.81 67.5 24.5 3.5 52.29 0.8 1000 993.28
deepseek-v3 greedy single 1 42.1 85.62 44.64 10 1 42.04 27.9 1000 2158
claude-3-5-sonnet-20241022 greedy single 1 36.2 84.69 28.93 4 1 54.27 0 1000 861.18
claude-3-5-sonnet-20240620 greedy single 1 33.4 83.44 21.79 3 0 54.34 0 1000 1141.94
Llama-3.1-405B-Inst-fp8@together greedy single 1 32.6 81.25 22.5 1.5 0 45.8 12.5 1000 314.66
gpt-4o-2024-08-06 greedy single 1 31.7 80 19.64 2.5 0.5 50.34 3.6 1000 1106.51
gemini-1.5-pro-exp-0827 greedy single 1 30.5 75.31 20.71 3 0 50.84 0.8 1000 1594.47
Llama-3.1-405B-Inst@sambanova greedy single 1 30.1 79.06 16.43 0.5 0.5 39.06 24.7 1000 2001.12
chatgpt-4o-latest-24-09-07 greedy single 1 29.9 76.88 17.86 1.5 0 48.83 4.2 1000 1539.99
Mistral-Large-2 greedy single 1 29 75.94 15 2.5 0 47.64 1.7 1000 1592.39
gpt-4-turbo-2024-04-09 greedy single 1 28.4 75.31 15 0.5 0 47.9 0.1 1000 1148.46
gpt-4o-2024-05-13 greedy single 1 28.2 73.75 16.43 0 0 38.72 19.3 1000 1643.51
grok-2-1212 greedy single 1 27.7 71.88 13.93 4 0 48.16 3.5 1000 2551.39
gpt-4-0314 greedy single 1 27.1 71.25 13.57 2.5 0 47.43 0.2 1000 1203.17
claude-3-opus-20240229 greedy single 1 27 73.44 12.14 0.5 0 48.91 0 1000 855.72
Qwen2.5-72B-Instruct greedy single 1 26.6 72.5 12.14 0 0 40.92 11.9 1000 1795.9
Qwen2.5-32B-Instruct greedy single 1 26.1 72.19 10.36 0.5 0 43.39 6.3 1000 1333.07
gemini-1.5-pro-exp-0801 greedy single 1 25.2 66.56 13.93 0 0 48.5 0 1000 1389.75
Llama-3.1-405B-Inst@hyperbolic greedy single 1 25 50 33.33 0 0 46.62 6.25 16 1517.13
gemini-1.5-flash-exp-0827 greedy single 1 25 65 13.57 2 0 43.56 8.5 1000 1705.11
Meta-Llama-3.1-70B-Instruct greedy single 1 24.9 67.81 10.36 1.5 0 27.98 43 1000 1483.68
deepseek-v2-chat-0628 greedy single 1 22.7 63.44 8.57 0 0 42.46 5.2 1000 1260.23
deepseek-v2.5-0908 greedy single 1 22.1 62.19 7.86 0 0 38.01 12.7 1000 1294.46
Qwen2-72B-Instruct greedy single 1 21.4 60.94 6.79 0 0 38.32 10.2 1000 1813.82
deepseek-v2-coder-0614 greedy single 1 21.1 59.69 7.14 0 0 41.58 4.9 1000 1324.55
deepseek-v2-coder-0724 greedy single 1 20.5 57.5 7.14 0.5 0 42.35 3.4 1000 1230.63
gpt-4o-mini-2024-07-18 greedy single 1 20.1 58.75 4.64 0 0 41.26 0.1 1000 943.52
gemini-1.5-flash greedy single 1 19.4 55 6.43 0 0 31.77 22.7 1000 1538.18
gemini-1.5-pro greedy single 1 19.4 52.19 9.64 0 0 44.59 0.8 1000 1336.17
yi-large-preview greedy single 1 18.9 53.75 6.07 0 0 42.61 1.4 1000 833.36
yi-large greedy single 1 18.8 54.37 5 0 0 39.83 1.8 1000 757.01
claude-3-5-haiku-20241022 greedy single 1 18.7 53.12 6.07 0 0 43.22 0.1 1000 660.91
claude-3-sonnet-20240229 greedy single 1 18.7 54.06 4.29 1 0 43.66 0 1000 1095.37
Meta-Llama-3-70B-Instruct greedy single 1 16.8 48.44 4.64 0 0 42.31 0.2 1000 809.95
Athene-70B greedy single 1 16.7 48.75 3.93 0 0 32.98 21.1 1000 391.19
gemma-2-27b-it greedy single 1 16.3 46.56 5 0 0 41.18 1.1 1000 1014.56
claude-3-haiku-20240307 greedy single 1 14.3 43.75 1.07 0 0 37.87 0.1 1000 1015.06
command-r-plus greedy single 1 13.9 40.94 2.86 0 0 39.01 0.2 1000 810.53
reka-core-20240501 greedy single 1 13 39.38 1.43 0 0 33.88 4 1000 1078.29
gemma-2-9b-it greedy single 1 12.8 37.81 2.5 0 0 36.79 0 1000 849.84
Meta-Llama-3.1-8B-Instruct greedy single 1 12.8 39.38 0.71 0 0 13.68 61.5 1000 1043.9
Qwen2.5-7B-Instruct greedy single 1 12 36.25 1.43 0 0 30.67 9.5 1000 850.93
Meta-Llama-3-8B-Instruct greedy single 1 11.9 36.88 0.36 0 0 23.7 29.2 1000 1216.4
Mistral-Nemo-Instruct-2407 greedy single 1 11.8 35.31 1.79 0 0 34.93 1.6 1000 925.88
Phi-3-mini-4k-instruct greedy single 1 11.6 35.94 0.36 0 0 13.5 59 1000 790.29
Yi-1.5-34B-Chat greedy single 1 11.5 35 1.07 0 0 32.73 4.4 1000 869.65
gpt-3.5-turbo-0125 greedy single 1 10.1 30.31 1.07 0.5 0 33.06 0.1 1000 820.66
command-r greedy single 1 9.9 30.31 0.71 0 0 32.66 1.5 1000 1005.17
reka-flash-20240226 greedy single 1 9.3 28.44 0.71 0 0 25.67 18.7 1000 1074.8
mathstral-7B-v0.1 greedy single 1 9 27.19 1.07 0 0 20.42 36 1000 1148.16
Mixtral-8x7B-Instruct-v0.1 greedy single 1 8.7 26.25 1.07 0 0 26.47 20.3 1000 1177.21
Qwen2-7B-Instruct greedy single 1 8.4 26.25 0 0 0 22.06 24.4 1000 1473.23
Llama-3.2-3B-Instruct@together greedy single 1 7.4 23.12 0 0 0 13.14 54.5 1000 963.47
Phi-3.5-mini-instruct greedy single 1 6.4 19.38 0.71 0 0 5.98 80.6 1000 718.43
Qwen2.5-3B-Instruct greedy single 1 4.8 15 0 0 0 11.44 56.7 1000 906.58
gemma-2-2b-it greedy single 1 4.2 13.12 0 0 0 9.97 57.2 1000 1032.89
Yi-1.5-9B-Chat greedy single 1 2.3 7.19 0 0 0 7.53 11.3 1000 1592.6