Benchmark correlations (%) with Chatbot Arena Elo, against the total costs of evaluating a single GPT-3.5-Turbo-0125 model. MixEval and MixEval-Hard show the highest correlations with Arena Elo and Arena Elo (En) among leading benchmarks. We reference the crowdsourcing price for Amazon Mechanical Turk ($0.05 per vote) when estimating the cost of evaluating a single model on Chatbot Arena (approximately $2,936). Chatbot Arena is prohibitively expensive, while MixEval and MixEval-Hard are cheap and cost-effective alternatives. For more details, please refer to our paper.

MixEval

We introduce MixEval, a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination.

The MixEval consists of two benchmarks: MixEval and MixEval-Hard, both updated with our fast, stable pipeline periodically. Both of them contain two splits, i.e., free-form and multiple-choice. Their relationships are presented below:

 MixEval (dynamic)
    │
    ├── MixEval
    │   ├──free-form.json
    │   └──multiple-choice.json
    │
    └── MixEval-Hard
        ├──free-form.json
        └──multiple-choice.json

See our homepage and paper for more details!

Click-and-Go LLM Evaluation Suite

This repository hosts the evaluation code and dynamic data release for MixEval. The current dynamic benchmark version is displayed at the top of this page. We offer a reliable click-and-go evaluation suite compatible with both open-source and proprietary models, which includes model response generation and score computation. Additionally, this evaluation suite facilitates straightforward registration of custom models and benchmark data.

As demonstrated in the paper, traditional rule-based parsers exhibit significant instability and are prone to considerable errors. We employ either GPT-3.5-Turbo or open-source models as our model parser, which has been proven stable in our and this study.

ATTENTION❗ Feel free to use your own evaluation code to evaluate with MixEval data. We provide the guidelines here.

Quick Start

(Step 1) Clone repo and setup the environment:

git clone https://github.com/Psycoy/MixEval.git
cd MixEval
conda create -n MixEval python=3.11 --yes
conda activate MixEval
bash setup.sh

# setup done

(Step 2) Setup the OpenAI API key for model parser. Add the below line to .env file:

MODEL_PARSER_API=<your openai api key>

The values in Leaderboard use GPT-3.5-Turbo-0125 as the default model parser. Open-source model parsers are also supported.

(Step 3) Run evaluation and get results. That's all!

python -m mix_eval.evaluate \
    --model_name gemma_11_7b_instruct \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --max_gpu_memory 5GiB \
    --output_dir mix_eval/data/model_responses/ \
    --api_parallel_num 20

If you want to evaluate models that are not included in mixeval.models.__init__, see here for the simple steps of new model registration.

This command will run both inference and score computation. If you want to run model inference only, check here; if you want to run score computation only, check here.

Model response files and scores will be saved to <output_folder>/<model_name>/<benchmark>/<version>/, and in this case, it's mix_eval/data/model_responses/gemma_11_7b_instruct/mixeval_hard/2024-06-01/. We take the overall score as the reported score in Leaderboard.

ATTENTION❗ It's important to read the essential configurations here before running the evaluation.

Registering New Models

(Step 1) Add your model file to mixeval/models/ with name your_model_name.py and write the model class in it with the name Model_Class_Name.

Open-source chat models are inherited from mixeval.models.base.ChatModel (example file: llama_3_8b_instruct.py).
Open-source base models are inherited from mixeval.models.base.BaseModel (example file: llama_3_8b.py).
Proprietary models are inherited from mixeval.models.base_api.APIModelBase (example file: gpt_4_turbo_2024_04_09.py, add your api key in .env).
In most cases, all you need to do is write a simple model class with a single __init__ function. However, if your model needs more setup, e.g., it requires a different build_model() function, you should override the corresponding function or variable of the parent model.
The model file name should be the same with the name you pass to the @register_model() decorator on top of the model class.

(Step 2) Add your model to mixeval.models.__init__.AVAILABLE_MODELS.

The entry you add should be in the form of your_model_name: Model_Class_Name. See other models in AVAILABLE_MODELS as a reference.

Only Performing Model Inference

Sometimes you may want to do model inference without computing the scores. You can achieve this by setting the --inference_only flag when running the mix_eval.evaluate module:

python -m mix_eval.evaluate \
    --model_name gemma_11_7b_instruct \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --max_gpu_memory 5GiB \
    --output_folder mix_eval/data/model_responses/ \
    --inference_only

Model response files will be saved to <output_folder>/<model_name>/<benchmark>/<version>/, and in this example it's mix_eval/data/model_responses/gemma_11_7b_instruct/mixeval_hard/2024-06-01/.

ATTENTION❗ It's important to read the essential configurations here before running the evaluation.

You can check whether the model response files are complete after running the inference:

python -m mix_eval.utils.check_eval_complete \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --chat_models_to_check \
    gpt_4o \
    llama_3_70b_instruct \
    claude_3_opus \
    --base_models_to_check \
    none \
    --model_response_dir mix_eval/data/model_responses/ \
    --out_path mix_eval/data/model_responses/eval_checks.log

The checking results will be written to --out_path; only problematic files will be recorded.

Only Computing Scores

If you want to separately compute the scores, you should

Prepare your model response files. You can use either our evaluation suite (refer to here) or your own (refer to the example response file formats and protocols specified here).

Run the score computation script:

python -m mix_eval.compute_metrics \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --model_response_dir mix_eval/data/model_responses/ \
    --api_parallel_num 20 \
    --models_to_eval \
    gemma_11_7b_instruct \
    gpt_4o \
    claude_3_opus

You should set the --api_parallel_num properly according to your OpenAI user tier to avoid rate limits. In general, if you are a Tier-5 user, you can set --api_parallel_num to 100 or more to parse results in 30 seconds.

If you are parsing base models' responses, set the --extract_base_model_response flag to only retain the meaningful part in models' response to get more stablized parsing results.

If you finished the model parsing some time ago and now want to display the model results again, add --compute_score_from_judged_file flag to avoid calling the model parser api again to save your budget. You have to make sure that there exists the parsed files with the name of judge_results_ff_model_judge_gpt-3.5-turbo-0125 and judge_results_mp_model_judge_gpt-3.5-turbo-0125 under the target model response folder, where gpt-3.5-turbo-0125 denotes the model parser name, ff denotes free-form, mp denotes multiple-choice.

What is MixEval?

See our homepage and paper for more details!

MixEval is an approach that bridges the gap between real-world user queries and efficient, reproducible evaluation by leveraging user queries mined from the web and matching them with similar queries from existing benchmarks. MixEval is also the proposed benchmark built with this approach.

MixEval-Hard is the hard version of MixEval, designed to enhance the benchmark's ability to distinguish strong models. It is sampled from MixEval based on model evaluation results, with a higher probability of selecting harder queries. To address distribution deviation, we introduce a rejective sampling process to ensure that the distribution of MixEval-Hard aligns with that of wild queries.

Dynamic evaluation is introduced to mitigate the contamination issue. We periodically update the data points in MixEval and MixEval-Hard using our fast, stable pipeline, which performs benchmark mixture with a different batch of wild queries from the same distribution, showing low variance (0.36 Std. on a 0-100 scale) and significant version difference (85% unique query ratio).

Why to Use MixEval Benchmarks?

MixEval offers five significant advantages for practitioners:

Accurate model ranking, demonstrated by a 0.96 correlation with Chatbot Arena1.
Fast, cheap and reproducible execution, requiring only 6% the time and cost of MMLU and with no dependence on human input.
Dynamic benchmarking enabled by low-effort and stable updating mechanism.
A comprehensive and less biased query distribution, as it bases queries on a large-scale web corpus.
A fair grading process, ensured by the ground-truth-based grading mechanism.

How Effective is MixEval as a Benchmark Mixture Approach?

MixEval is effective as

MixEval and MixEval-Hard achieve the highest correlation with Arena Elo and Arena Elo (En) among all benchmarks.
MixEval improves the correlation with Arena Elo and Arena Elo (En) across all its main benchmark splits.
MixEval outperforms both benchmark-level and uniform mixtures.
MixEval effectively maps real-world user queries to ground-truth-based benchmarks.

🦾 Contribute

Feel free to hit the ⭐star button or 🦾contribute! We review new issues and PRs regularly and will acknowledge your contributions!

📑 Citation

If you found this repository useful, please consider 📑citing:

@article{ni2024mixeval,
        title={MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures},
        author={Jinjie Ni and Fuzhao Xue and Xiang Yue and Yuntian Deng and Mahir Shah and Kabir Jain and Graham Neubig and Yang You},
        journal={arXiv preprint arXiv:[placeholder]},
        year={2024}
      },

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MixEval

Click-and-Go LLM Evaluation Suite

Quick Start

Registering New Models

Only Performing Model Inference

Only Computing Scores

What is MixEval?

Why to Use MixEval Benchmarks?

How Effective is MixEval as a Benchmark Mixture Approach?

🦾 Contribute

📑 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MixEval

Click-and-Go LLM Evaluation Suite

Quick Start

Registering New Models

Only Performing Model Inference

Only Computing Scores

What is MixEval?

Why to Use MixEval Benchmarks?

How Effective is MixEval as a Benchmark Mixture Approach?

🦾 Contribute

📑 Citation