We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I run evaluation Zero-shot results of LLMs on MalayMMLU (First token accuracy) using,
python src/evaluate.py --by_letter --shot 0 --task=MalayMMLU --base_model=Model --output_folder=output/ python src/calculate_accuracies.py --pred_files File --shot=0 --output_dir=output/
I will keep posting evaluations in this issue.
Model Accuracy shot by_letter category 0 gemma-2-2b-it 60.376586 0shot True STEM 1 gemma-2-2b-it 61.275445 0shot True Language 2 gemma-2-2b-it 57.386528 0shot True Social science 3 gemma-2-2b-it 57.783641 0shot True Others 4 gemma-2-2b-it 60.819113 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : gemma-2-2b-it Metric : first Shot : 0shot average accuracy 59.38958410771073 accuracy for STEM 60.37658616455178 accuracy for Language 61.275445292620866 accuracy for Social science 57.38652789823648 accuracy for Others 57.78364116094987 accuracy for Humanities 60.81911262798635
Model Accuracy shot by_letter category 0 Qwen2.5-7B-Instruct 70.568973 0shot True STEM 1 Qwen2.5-7B-Instruct 68.034351 0shot True Language 2 Qwen2.5-7B-Instruct 63.486557 0shot True Social science 3 Qwen2.5-7B-Instruct 64.140082 0shot True Others 4 Qwen2.5-7B-Instruct 69.124005 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : Qwen2.5-7B-Instruct Metric : first Shot : 0shot average accuracy 66.51798620575724 accuracy for STEM 70.56897257470324 accuracy for Language 68.03435114503816 accuracy for Social science 63.486556808326114 accuracy for Others 64.14008155432957 accuracy for Humanities 69.1240045506257
Model Accuracy shot by_letter category 0 Qwen2.5-3B-Instruct 55.055260 0shot True STEM 1 Qwen2.5-3B-Instruct 60.496183 0shot True Language 2 Qwen2.5-3B-Instruct 49.494073 0shot True Social science 3 Qwen2.5-3B-Instruct 50.683617 0shot True Others 4 Qwen2.5-3B-Instruct 57.610922 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : Qwen2.5-3B-Instruct Metric : first Shot : 0shot average accuracy 54.59050923057861 accuracy for STEM 55.0552599263201 accuracy for Language 60.49618320610687 accuracy for Social science 49.49407343162764 accuracy for Others 50.68361717438235 accuracy for Humanities 57.61092150170648
Model Accuracy shot by_letter category 0 Qwen2.5-1.5B-Instruct 57.224724 0shot True STEM 1 Qwen2.5-1.5B-Instruct 52.910305 0shot True Language 2 Qwen2.5-1.5B-Instruct 51.691240 0shot True Social science 3 Qwen2.5-1.5B-Instruct 52.578556 0shot True Others 4 Qwen2.5-1.5B-Instruct 57.315131 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : Qwen2.5-1.5B-Instruct Metric : first Shot : 0shot average accuracy 53.73972659315244 accuracy for STEM 57.22472370036839 accuracy for Language 52.91030534351145 accuracy for Social science 51.69124024284475 accuracy for Others 52.57855600863517 accuracy for Humanities 57.31513083048919
Model Accuracy shot by_letter category 0 Qwen2.5-0.5B-Instruct 48.260336 0shot True STEM 1 Qwen2.5-0.5B-Instruct 45.038168 0shot True Language 2 Qwen2.5-0.5B-Instruct 45.706852 0shot True Social science 3 Qwen2.5-0.5B-Instruct 46.725834 0shot True Others 4 Qwen2.5-0.5B-Instruct 50.079636 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : Qwen2.5-0.5B-Instruct Metric : first Shot : 0shot average accuracy 46.760004956015365 accuracy for STEM 48.26033565288579 accuracy for Language 45.038167938931295 accuracy for Social science 45.70685169124024 accuracy for Others 46.7258335332214 accuracy for Humanities 50.07963594994311
Model Accuracy shot by_letter category 0 Llama-3.2-3B-Instruct 56.692591 0shot True STEM 1 Llama-3.2-3B-Instruct 58.460560 0shot True Language 2 Llama-3.2-3B-Instruct 54.018502 0shot True Social science 3 Llama-3.2-3B-Instruct 52.842408 0shot True Others 4 Llama-3.2-3B-Instruct 60.455063 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : Llama-3.2-3B-Instruct Metric : first Shot : 0shot average accuracy 56.40771486391608 accuracy for STEM 56.69259107654523 accuracy for Language 58.460559796437664 accuracy for Social science 54.01850245735762 accuracy for Others 52.842408251379226 accuracy for Humanities 60.455062571103525
Model Accuracy shot by_letter category 0 Llama-3.2-1B-Instruct 40.073680 0shot True STEM 1 Llama-3.2-1B-Instruct 38.692748 0shot True Language 2 Llama-3.2-1B-Instruct 39.751373 0shot True Social science 3 Llama-3.2-1B-Instruct 39.242024 0shot True Others 4 Llama-3.2-1B-Instruct 43.367463 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : Llama-3.2-1B-Instruct Metric : first Shot : 0shot average accuracy 40.07764424069714 accuracy for STEM 40.07367990176013 accuracy for Language 38.69274809160305 accuracy for Social science 39.751373229257005 accuracy for Others 39.24202446629887 accuracy for Humanities 43.3674630261661
Model Accuracy shot by_letter category 0 mallam-small-2.0 62.949652 0shot True STEM 1 mallam-small-2.0 72.642494 0shot True Language 2 mallam-small-2.0 66.657415 0shot True Social science 3 mallam-small-2.0 63.261693 0shot True Others 4 mallam-small-2.0 67.506257 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : mallam-small-2.0 Metric : first Shot : 0shot average accuracy 66.60350234116301 accuracy for STEM 62.94965206713058 accuracy for Language 72.64249363867684 accuracy for Social science 66.65741543798785 accuracy for Others 63.261693451667064 accuracy for Humanities 67.50625711035268
Model Accuracy shot by_letter category 0 mallam-tiny-2.0 55.540729 0shot True STEM 1 mallam-tiny-2.0 61.828244 0shot True Language 2 mallam-tiny-2.0 56.799075 0shot True Social science 3 mallam-tiny-2.0 55.058287 0shot True Others 4 mallam-tiny-2.0 57.472127 0shot True Humanities {'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} Model : mallam-tiny-2.0 Metric : first Shot : 0shot average accuracy 57.33969250818039 accuracy for STEM 55.54072861236185 accuracy for Language 61.82824427480916 accuracy for Social science 56.799074877132114 accuracy for Others 55.05828735907892 accuracy for Humanities 57.472127417519914
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I run evaluation Zero-shot results of LLMs on MalayMMLU (First token accuracy) using,
I will keep posting evaluations in this issue.
Claude 3.5 Sonnet (Amazon Bedrock Edition)
google/gemma-2-27b-it
google/gemma-2-2b-it
Qwen/Qwen2.5-14B-Instruct
Qwen/Qwen2.5-7B-Instruct
Qwen/Qwen2.5-3B-Instruct
Qwen/Qwen2.5-1.5B-Instruct
Qwen/Qwen2.5-0.5B-Instruct
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-3.2-1B-Instruct
Mesolitica MaLLaM 2.5 Small
Mesolitica MaLLaM 2.5 Tiny
Mesolitica MaLLaM 2.0 Small
Mesolitica MaLLaM 2.0 Tiny
The text was updated successfully, but these errors were encountered: