Not an issue, just provide additional benchmarks #1

huseinzol05 · 2024-10-10T02:06:33Z

I run evaluation Zero-shot results of LLMs on MalayMMLU (First token accuracy) using,

python src/evaluate.py --by_letter --shot 0 --task=MalayMMLU --base_model=Model --output_folder=output/
python src/calculate_accuracies.py --pred_files File --shot=0 --output_dir=output/

I will keep posting evaluations in this issue.

Claude 3.5 Sonnet (Amazon Bedrock Edition)

google/gemma-2-27b-it

google/gemma-2-2b-it

           Model   Accuracy   shot by_letter        category
0  gemma-2-2b-it  60.376586  0shot      True            STEM
1  gemma-2-2b-it  61.275445  0shot      True        Language
2  gemma-2-2b-it  57.386528  0shot      True  Social science
3  gemma-2-2b-it  57.783641  0shot      True          Others
4  gemma-2-2b-it  60.819113  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : gemma-2-2b-it
Metric : first
Shot : 0shot
average accuracy 59.38958410771073
accuracy for STEM 60.37658616455178
accuracy for Language 61.275445292620866
accuracy for Social science 57.38652789823648
accuracy for Others 57.78364116094987
accuracy for Humanities 60.81911262798635

Qwen/Qwen2.5-14B-Instruct

Qwen/Qwen2.5-7B-Instruct

                 Model   Accuracy   shot by_letter        category
0  Qwen2.5-7B-Instruct  70.568973  0shot      True            STEM
1  Qwen2.5-7B-Instruct  68.034351  0shot      True        Language
2  Qwen2.5-7B-Instruct  63.486557  0shot      True  Social science
3  Qwen2.5-7B-Instruct  64.140082  0shot      True          Others
4  Qwen2.5-7B-Instruct  69.124005  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Qwen2.5-7B-Instruct
Metric : first
Shot : 0shot
average accuracy 66.51798620575724
accuracy for STEM 70.56897257470324
accuracy for Language 68.03435114503816
accuracy for Social science 63.486556808326114
accuracy for Others 64.14008155432957
accuracy for Humanities 69.1240045506257

Qwen/Qwen2.5-3B-Instruct

                 Model   Accuracy   shot by_letter        category
0  Qwen2.5-3B-Instruct  55.055260  0shot      True            STEM
1  Qwen2.5-3B-Instruct  60.496183  0shot      True        Language
2  Qwen2.5-3B-Instruct  49.494073  0shot      True  Social science
3  Qwen2.5-3B-Instruct  50.683617  0shot      True          Others
4  Qwen2.5-3B-Instruct  57.610922  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Qwen2.5-3B-Instruct
Metric : first
Shot : 0shot
average accuracy 54.59050923057861
accuracy for STEM 55.0552599263201
accuracy for Language 60.49618320610687
accuracy for Social science 49.49407343162764
accuracy for Others 50.68361717438235
accuracy for Humanities 57.61092150170648

Qwen/Qwen2.5-1.5B-Instruct

                   Model   Accuracy   shot by_letter        category
0  Qwen2.5-1.5B-Instruct  57.224724  0shot      True            STEM
1  Qwen2.5-1.5B-Instruct  52.910305  0shot      True        Language
2  Qwen2.5-1.5B-Instruct  51.691240  0shot      True  Social science
3  Qwen2.5-1.5B-Instruct  52.578556  0shot      True          Others
4  Qwen2.5-1.5B-Instruct  57.315131  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Qwen2.5-1.5B-Instruct
Metric : first
Shot : 0shot
average accuracy 53.73972659315244
accuracy for STEM 57.22472370036839
accuracy for Language 52.91030534351145
accuracy for Social science 51.69124024284475
accuracy for Others 52.57855600863517
accuracy for Humanities 57.31513083048919

Qwen/Qwen2.5-0.5B-Instruct

                   Model   Accuracy   shot by_letter        category
0  Qwen2.5-0.5B-Instruct  48.260336  0shot      True            STEM
1  Qwen2.5-0.5B-Instruct  45.038168  0shot      True        Language
2  Qwen2.5-0.5B-Instruct  45.706852  0shot      True  Social science
3  Qwen2.5-0.5B-Instruct  46.725834  0shot      True          Others
4  Qwen2.5-0.5B-Instruct  50.079636  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Qwen2.5-0.5B-Instruct
Metric : first
Shot : 0shot
average accuracy 46.760004956015365
accuracy for STEM 48.26033565288579
accuracy for Language 45.038167938931295
accuracy for Social science 45.70685169124024
accuracy for Others 46.7258335332214
accuracy for Humanities 50.07963594994311

meta-llama/Llama-3.2-3B-Instruct

                   Model   Accuracy   shot by_letter        category
0  Llama-3.2-3B-Instruct  56.692591  0shot      True            STEM
1  Llama-3.2-3B-Instruct  58.460560  0shot      True        Language
2  Llama-3.2-3B-Instruct  54.018502  0shot      True  Social science
3  Llama-3.2-3B-Instruct  52.842408  0shot      True          Others
4  Llama-3.2-3B-Instruct  60.455063  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Llama-3.2-3B-Instruct
Metric : first
Shot : 0shot
average accuracy 56.40771486391608
accuracy for STEM 56.69259107654523
accuracy for Language 58.460559796437664
accuracy for Social science 54.01850245735762
accuracy for Others 52.842408251379226
accuracy for Humanities 60.455062571103525

meta-llama/Llama-3.2-1B-Instruct

                   Model   Accuracy   shot by_letter        category
0  Llama-3.2-1B-Instruct  40.073680  0shot      True            STEM
1  Llama-3.2-1B-Instruct  38.692748  0shot      True        Language
2  Llama-3.2-1B-Instruct  39.751373  0shot      True  Social science
3  Llama-3.2-1B-Instruct  39.242024  0shot      True          Others
4  Llama-3.2-1B-Instruct  43.367463  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Llama-3.2-1B-Instruct
Metric : first
Shot : 0shot
average accuracy 40.07764424069714
accuracy for STEM 40.07367990176013
accuracy for Language 38.69274809160305
accuracy for Social science 39.751373229257005
accuracy for Others 39.24202446629887
accuracy for Humanities 43.3674630261661

Mesolitica MaLLaM 2.5 Small

Mesolitica MaLLaM 2.5 Tiny

Mesolitica MaLLaM 2.0 Small

                            Model   Accuracy   shot by_letter        category
0  mallam-small-2.0  62.949652  0shot      True            STEM
1  mallam-small-2.0  72.642494  0shot      True        Language
2  mallam-small-2.0  66.657415  0shot      True  Social science
3  mallam-small-2.0  63.261693  0shot      True          Others
4  mallam-small-2.0  67.506257  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : mallam-small-2.0
Metric : first
Shot : 0shot
average accuracy 66.60350234116301
accuracy for STEM 62.94965206713058
accuracy for Language 72.64249363867684
accuracy for Social science 66.65741543798785
accuracy for Others 63.261693451667064
accuracy for Humanities 67.50625711035268

Mesolitica MaLLaM 2.0 Tiny

                           Model   Accuracy   shot by_letter        category
0  mallam-tiny-2.0  55.540729  0shot      True            STEM
1  mallam-tiny-2.0  61.828244  0shot      True        Language
2  mallam-tiny-2.0  56.799075  0shot      True  Social science
3  mallam-tiny-2.0  55.058287  0shot      True          Others
4  mallam-tiny-2.0  57.472127  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : mallam-tiny-2.0
Metric : first
Shot : 0shot
average accuracy 57.33969250818039
accuracy for STEM 55.54072861236185
accuracy for Language 61.82824427480916
accuracy for Social science 56.799074877132114
accuracy for Others 55.05828735907892
accuracy for Humanities 57.472127417519914

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not an issue, just provide additional benchmarks #1

Not an issue, just provide additional benchmarks #1

huseinzol05 commented Oct 10, 2024 •

edited

Loading

Not an issue, just provide additional benchmarks #1

Not an issue, just provide additional benchmarks #1

Comments

huseinzol05 commented Oct 10, 2024 • edited Loading

Claude 3.5 Sonnet (Amazon Bedrock Edition)

google/gemma-2-27b-it

google/gemma-2-2b-it

Qwen/Qwen2.5-14B-Instruct

Qwen/Qwen2.5-7B-Instruct

Qwen/Qwen2.5-3B-Instruct

Qwen/Qwen2.5-1.5B-Instruct

Qwen/Qwen2.5-0.5B-Instruct

meta-llama/Llama-3.2-3B-Instruct

meta-llama/Llama-3.2-1B-Instruct

Mesolitica MaLLaM 2.5 Small

Mesolitica MaLLaM 2.5 Tiny

Mesolitica MaLLaM 2.0 Small

Mesolitica MaLLaM 2.0 Tiny

huseinzol05 commented Oct 10, 2024 •

edited

Loading