In this work, we present SailCompass, a comprehensive suite of evaluation scripts designed for robust and reproducible evaluation of multilingual language models targeting Southeast Asian languages.
SailCompass encompasses three major SEA languages and covers eight primary tasks using 14 datasets, spanning three task types: generation, multiple-choice questions, and classification.
Please refer to SailCompass Paper for more details.
We use OpenCompass to evaluate the models. To install the required packages, run the following command under this folder:
conda create --name sailcompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate sailcompass
git clone https://github.com/sail-sg/sailcompass sailcompass
###git clone submodule
cd sailcompass
git submodule update --init --recursive
###git clone opencompass and copy the config
bash setup_environment.sh
###download eval data from huggingface
mkdir data
python download_eval_data.py
To build the evaluation script, run the following command under this folder:
bash setup_sailcompass.sh
To run the evaluation, run the following command under this folder:
cd opencompass
python run.py configs/eval_sailcompass.py -w outputs/sailcompass --num-gpus 1 --max-num-workers 64 --debug
Thanks to the contributors of the opencompass.
If you use sailcompass benchmark in your work, please cite
@misc{sailcompass,
title={SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages},
author={Jia Guo and Longxu Dou and Guangtao Zeng and Stanley Kok and Wei Lu and Qian Liu},
year={2024},
}
If you have any questions, please raise an issue on our GitHub repository or contact [email protected].