Skip to content

Latest commit

 

History

History
228 lines (178 loc) · 10.7 KB

Evaluation.md

File metadata and controls

228 lines (178 loc) · 10.7 KB

Evaluation

SWIFT supports eval (evaluation) capabilities to provide standardized evaluation metrics for both raw models and trained models.

Capability Introduction

SWIFT's eval capability utilizes the EvalScope evaluation framework from the Magic Tower community, which has been advanced in its encapsulation to support the evaluation needs of various models.

Note: EvalScope supports many other complex capabilities, such as model performance evaluation, so please use the EvalScope framework directly.

Currently, we support the evaluation process of standard evaluation datasets as well as the evaluation process of user-defined evaluation datasets. The standard evaluation datasets are supported by three evaluation backends:

Below are the names of the supported datasets. For detailed information on the datasets, please refer to all supported datasets.

  1. Native (default):

    Primarily supports pure text evaluation, while supporting visualization of evaluation results.

    'arc', 'bbh', 'ceval', 'cmmlu', 'competition_math',
    'general_qa', 'gpqa', 'gsm8k', 'hellaswag', 'humaneval',
    'ifeval', 'iquiz', 'mmlu', 'mmlu_pro',
    'race', 'trivia_qa', 'truthful_qa'
    
  2. OpenCompass:

    Primarily supports pure text evaluation, currently does not support visualization of evaluation results.

    'obqa', 'cmb', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada',
    'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze',
    'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval',
    'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench',
    'ARC_e', 'COPA', 'ARC_c', 'DRCD'
    
  3. VLMEvalKit:

    Primarily supports multimodal evaluation and currently does not support visualization of evaluation results.

    'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN',
    'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11',
    'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2',
    'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL',
    'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar',
    'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL',
    'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI',
    'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST',
    'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500',
    'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL',
    'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME'
    

Environment Preparation

pip install ms-swift[eval] -U

Or install from source:

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e '.[eval]'

Evaluation

Supports four methods of evaluation: pure text evaluation, multimodal evaluation, URL evaluation, and custom dataset evaluation.

Basic Example

CUDA_VISIBLE_DEVICES=0 \
swift eval \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --eval_backend Native \
    --infer_backend pt \
    --eval_limit 10 \
    --eval_dataset gsm8k

Where:

  • model: Can specify a local model path or a model ID on modelscope
  • eval_backend: Options are Native, OpenCompass, VLMEvalKit; default is Native
  • infer_backend: Options are pt, vllm, lmdeploy; default is pt
  • eval_limit: Sample size for each evaluation set; default is None, which means using all data; can be used for quick validation
  • eval_dataset: Evaluation dataset(s); multiple datasets can be set, separated by spaces

For a specific list of evaluation parameters, please refer to here.

More evaluation examples can be found in examples.

Custom Evaluation Datasets

This framework supports two predefined dataset formats: multiple-choice questions (MCQ) and question-and-answer (QA). The usage process is as follows:

Note: When using a custom evaluation, the eval_backend parameter must be set to Native.

Multiple-Choice Question Format (MCQ)

This format is suitable for scenarios involving multiple-choice questions, and the evaluation metric is accuracy.

Data Preparation

Prepare a CSV file in the multiple-choice question format, structured as follows:

mcq/
├── example_dev.csv  # (Optional) The filename should follow the format `{subset_name}_dev.csv` for few-shot evaluation
└── example_val.csv  # The filename should follow the format `{subset_name}_val.csv` for the actual evaluation data

The CSV file should follow this format:

id,question,A,B,C,D,answer
1,Generally speaking, the amino acids that make up animal proteins are____,4 types,22 types,20 types,19 types,C
2,Among the substances present in the blood, which is not a metabolic end product?____,Urea,Uric acid,Pyruvate,Carbon dioxide,C

Where:

  • id is an optional index
  • question is the question
  • A, B, C, D, etc. are the options, with a maximum of 10 options
  • answer is the correct option

Launching Evaluation

Run the following command:

CUDA_VISIBLE_DEVICES=0 \
swift eval \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --eval_backend Native \
    --infer_backend pt \
    --eval_dataset general_mcq \
    --dataset_args '{"general_mcq": {"local_path": "/path/to/mcq", "subset_list": ["example"]}}'

Where:

  • eval_dataset should be set to general_mcq
  • dataset_args should be set with:
    • local_path as the path to the custom dataset folder
    • subset_list as the name of the evaluation dataset, taken from the *_dev.csv mentioned above

Running Results

+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Model               | Dataset     | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+=============+=================+==========+=======+=========+=========+
| Qwen2-0.5B-Instruct | general_mcq | AverageAccuracy | example  |    12 |  0.5833 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+

Question-and-Answer Format (QA)

This format is suitable for scenarios involving question-and-answer, and the evaluation metrics are ROUGE and BLEU.

Data Preparation

Prepare a JSON Lines file in the question-and-answer format, containing one file in the following structure:

qa/
└── example.jsonl

The JSON Lines file should follow this format:

{"query": "What is the capital of China?", "response": "The capital of China is Beijing"}
{"query": "What is the highest mountain in the world?", "response": "It is Mount Everest"}
{"query": "Why can't penguins be seen in the Arctic?", "response": "Because most penguins live in Antarctica"}

Launching Evaluation

Run the following command:

CUDA_VISIBLE_DEVICES=0 \
swift eval \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --eval_backend Native \
    --infer_backend pt \
    --eval_dataset general_qa \
    --dataset_args '{"general_qa": {"local_path": "/path/to/qa", "subset_list": ["example"]}}'

Where:

  • eval_dataset should be set to general_qa
  • dataset_args is a JSON string that needs to be set with:
    • local_path as the path to the custom dataset folder
    • subset_list as the name of the evaluation dataset, taken from the *.jsonl mentioned above

Running Results

+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Model               | Dataset     | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+=============+=================+==========+=======+=========+=========+
| Qwen2-0.5B-Instruct | general_qa  | bleu-1          | default  |    12 |  0.2324 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | bleu-2          | default  |    12 |  0.1451 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | bleu-3          | default  |    12 |  0.0625 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | bleu-4          | default  |    12 |  0.0556 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-1-f       | default  |    12 |  0.3441 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-1-p       | default  |    12 |  0.2393 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-1-r       | default  |    12 |  0.8889 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-2-f       | default  |    12 |  0.2062 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-2-p       | default  |    12 |  0.1453 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-2-r       | default  |    12 |  0.6167 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-l-f       | default  |    12 |  0.333  | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-l-p       | default  |    12 |  0.2324 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Qwen2-0.5B-Instruct | general_qa  | rouge-l-r       | default  |    12 |  0.8889 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+