SWIFT supports eval (evaluation) capabilities to provide standardized evaluation metrics for both raw models and trained models.
SWIFT's eval capability utilizes the EvalScope evaluation framework from the Magic Tower community, which has been advanced in its encapsulation to support the evaluation needs of various models.
Note: EvalScope supports many other complex capabilities, such as model performance evaluation, so please use the EvalScope framework directly.
Currently, we support the evaluation process of standard evaluation datasets as well as the evaluation process of user-defined evaluation datasets. The standard evaluation datasets are supported by three evaluation backends:
Below are the names of the supported datasets. For detailed information on the datasets, please refer to all supported datasets.
-
Native (default):
Primarily supports pure text evaluation, while supporting visualization of evaluation results.
'arc', 'bbh', 'ceval', 'cmmlu', 'competition_math', 'general_qa', 'gpqa', 'gsm8k', 'hellaswag', 'humaneval', 'ifeval', 'iquiz', 'mmlu', 'mmlu_pro', 'race', 'trivia_qa', 'truthful_qa'
-
OpenCompass:
Primarily supports pure text evaluation, currently does not support visualization of evaluation results.
'obqa', 'cmb', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
-
VLMEvalKit:
Primarily supports multimodal evaluation and currently does not support visualization of evaluation results.
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL', 'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI', 'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST', 'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500', 'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL', 'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME'
pip install ms-swift[eval] -U
Or install from source:
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e '.[eval]'
Supports four methods of evaluation: pure text evaluation, multimodal evaluation, URL evaluation, and custom dataset evaluation.
Basic Example
CUDA_VISIBLE_DEVICES=0 \
swift eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--eval_backend Native \
--infer_backend pt \
--eval_limit 10 \
--eval_dataset gsm8k
Where:
- model: Can specify a local model path or a model ID on modelscope
- eval_backend: Options are Native, OpenCompass, VLMEvalKit; default is Native
- infer_backend: Options are pt, vllm, lmdeploy; default is pt
- eval_limit: Sample size for each evaluation set; default is None, which means using all data; can be used for quick validation
- eval_dataset: Evaluation dataset(s); multiple datasets can be set, separated by spaces
For a specific list of evaluation parameters, please refer to here.
More evaluation examples can be found in examples.
This framework supports two predefined dataset formats: multiple-choice questions (MCQ) and question-and-answer (QA). The usage process is as follows:
Note: When using a custom evaluation, the eval_backend
parameter must be set to Native
.
This format is suitable for scenarios involving multiple-choice questions, and the evaluation metric is accuracy.
Data Preparation
Prepare a CSV file in the multiple-choice question format, structured as follows:
mcq/
├── example_dev.csv # (Optional) The filename should follow the format `{subset_name}_dev.csv` for few-shot evaluation
└── example_val.csv # The filename should follow the format `{subset_name}_val.csv` for the actual evaluation data
The CSV file should follow this format:
id,question,A,B,C,D,answer
1,Generally speaking, the amino acids that make up animal proteins are____,4 types,22 types,20 types,19 types,C
2,Among the substances present in the blood, which is not a metabolic end product?____,Urea,Uric acid,Pyruvate,Carbon dioxide,C
Where:
id
is an optional indexquestion
is the questionA
,B
,C
,D
, etc. are the options, with a maximum of 10 optionsanswer
is the correct option
Launching Evaluation
Run the following command:
CUDA_VISIBLE_DEVICES=0 \
swift eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--eval_backend Native \
--infer_backend pt \
--eval_dataset general_mcq \
--dataset_args '{"general_mcq": {"local_path": "/path/to/mcq", "subset_list": ["example"]}}'
Where:
eval_dataset
should be set togeneral_mcq
dataset_args
should be set with:local_path
as the path to the custom dataset foldersubset_list
as the name of the evaluation dataset, taken from the*_dev.csv
mentioned above
Running Results
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=====================+=============+=================+==========+=======+=========+=========+
| Qwen2-0.5B-Instruct | general_mcq | AverageAccuracy | example | 12 | 0.5833 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
This format is suitable for scenarios involving question-and-answer, and the evaluation metrics are ROUGE
and BLEU
.
Data Preparation
Prepare a JSON Lines file in the question-and-answer format, containing one file in the following structure:
qa/
└── example.jsonl
The JSON Lines file should follow this format:
{"query": "What is the capital of China?", "response": "The capital of China is Beijing"}
{"query": "What is the highest mountain in the world?", "response": "It is Mount Everest"}
{"query": "Why can't penguins be seen in the Arctic?", "response": "Because most penguins live in Antarctica"}
Launching Evaluation
Run the following command:
CUDA_VISIBLE_DEVICES=0 \
swift eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--eval_backend Native \
--infer_backend pt \
--eval_dataset general_qa \
--dataset_args '{"general_qa": {"local_path": "/path/to/qa", "subset_list": ["example"]}}'
Where:
eval_dataset
should be set togeneral_qa
dataset_args
is a JSON string that needs to be set with:local_path
as the path to the custom dataset foldersubset_list
as the name of the evaluation dataset, taken from the*.jsonl
mentioned above
Running Results
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=====================+=============+=================+==========+=======+=========+=========+
| Qwen2-0.5B-Instruct | general_qa | bleu-1 | default | 12 | 0.2324 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | bleu-2 | default | 12 | 0.1451 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | bleu-3 | default | 12 | 0.0625 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | bleu-4 | default | 12 | 0.0556 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-1-f | default | 12 | 0.3441 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-1-p | default | 12 | 0.2393 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-1-r | default | 12 | 0.8889 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-2-f | default | 12 | 0.2062 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-2-p | default | 12 | 0.1453 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-2-r | default | 12 | 0.6167 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-l-f | default | 12 | 0.333 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-l-p | default | 12 | 0.2324 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa | rouge-l-r | default | 12 | 0.8889 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+