Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

[arXiv]

Craft the null response

Run notebook_gpt4/gpt-4-1106-preview_vs_nil.ipynb to get the null response augmented with the adversarial string.

Evaluation

Step 1: Prepare the submission

Run 01_prepare_submission.ipynb to craft the null model submission.

Step 2: Evaluate the submission using alpaca-eval

To install the stable release of AlpacaEval 2.0, run

pip install alpaca-eval

Then you can use it to evaluate the submission as follows:

export OPENAI_API_KEY=<your_api_key> # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md 
alpaca_eval --model_outputs 'example/outputs.json'

Step 3 (Optional): Re-evaluate the submission for further analysis

Run 02_re_evaluate_submission.ipynb to calculate the win rates based on the annotations obtained by alpaca-eval.

For example, you can get the following win rates using the alpaca-eval annotations of our null model.

{'win_rate': 76.91979180386511, 
'standard_error': 0.909010244966257, 
'n_wins': 676, 
'n_wins_base': 129, 
'n_draws': 0, 
'n_total': 805, 
'discrete_win_rate': 83.97515527950311,
'length_controlled_winrate': 86.45780691307944, 
'lc_standard_error': 0.1418000511342794}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
example		example
notebook_gpt4		notebook_gpt4
viz		viz
01_prepare_submission.ipynb		01_prepare_submission.ipynb
02_re_evaluate_submission.ipynb		02_re_evaluate_submission.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Craft the null response

Evaluation

Step 1: Prepare the submission

Step 2: Evaluate the submission using alpaca-eval

Step 3 (Optional): Re-evaluate the submission for further analysis

About

Languages

License

sail-sg/Cheating-LLM-Benchmarks

Folders and files

Latest commit

History

Repository files navigation

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Craft the null response

Evaluation

Step 1: Prepare the submission

Step 2: Evaluate the submission using alpaca-eval

Step 3 (Optional): Re-evaluate the submission for further analysis

About

Resources

License

Stars

Watchers

Forks

Languages