π₯Quick Start β’ π»LLM code β’ π¨Tools β’ πCitation β’ πAcknowledgement
Important
π€ Request for independent model evaluation is open!
Warning
To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:
- β¨ improves code benchmarks by adding up to thousands of new tests! (80x for HumanEval and 35x for MBPP!)
- β¨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
- β¨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 20+ models -- no need to re-run the expensive benchmarks!
Want to know more details? Please read our NeurIPS'23 paper !
To get started, please first setup the environment:
pip install evalplus --upgrade
β¬ Install nightly version :: click to expand ::
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
β¬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
Implement the GEN_SOLUTION
function by calling the LLM to produce the complete solution (include the code) and save the samples to samples.jsonl
:
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl
samples = [
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
π€ Structure of `problem`? :: click to expand ::
task_id
is the identifier string for the taskentry_point
is name of the functionprompt
is the function signature with docstring
canonical_solution
is the ground-truth implementation (re-implemented to fix bugs in HumanEval)base_input
is the test inputs in original HumanEvalplus_input
is the test inputs brought by EvalPlus
Note
Expected Schema of samples.jsonl
task_id
: Task ID, which are the keys ofget_[human_eval|mbpp]_plus()
solution
(optional): Self-contained solution (usually including the prompt)- Example:
{"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
- Example:
completion
(optional): Function body without prompt- Example:
{"task_id": "HumanEval/?", "completion": " return 1"}
- Example:
Only one of solution
and completion
is required. If both are provided, solution
will be used.
We also accept solutions in the form of directory, i.e., --samples ${SAMPLE_DIR}
where ${SAMPLE_DIR}
is organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py
(${TASK_ID} = task_id.replace("/", "_")
).
You are strongly recommended to use a sandbox such as docker:
docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl
...Or if you want to try it locally regardless of the risks
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl
Warning
Do you use a very slow machine?
LLM solutions are regarded as failed on timeout (and OOM etc.).
Specifically, we set the timeout
-
$T_{base}$ is the minimal timeout (configurable by--min-time-limit
; default to 0.2s); -
$T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling); -
$k$ is a configurable factor--gt-time-limit-factor
(default to 4);
If your machine is too slow and you are getting high-variance results, try to use larger
Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation.
For example, using --parallel 64
on a 4-core machine or doing something else during evaluation are bad ideas...
π€ Evaluate with local GitHub repo? :: click to expand ::
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
β¨οΈ More command-line flags :: click to expand ::
--parallel
: by default half of the cores--base-only
(store_ture): only run base HumanEval tests--i-just-wanna-run
: force a re-run
The output should be like (below is GPT-4 greedy decoding example):
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|ββββββββββββββββββββββββββββββββββββββββββ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
Base
is thepass@k
for the original HumanEvalBase + Extra
is thepass@k
for the our HumanEval+ (with extra tests)- The "k" includes
[1, 10, 100]
where k values<=
the sample size will be used - A cache file named like
samples_eval_results.jsonl
will be cached. Remove it to re-run the evaluation
π€ How long it would take? :: click to expand ::
If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds.
When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minute by using --parallel 64
and --test-details
.
Here are some tips to speed up the evaluation:
- Use
--parallel $(nproc)
- Do NOT use
--test-details
if you just want to quickly get pass@k as--test-details
will run all tests (700+ on average for each task), while without--test-details
the testing for a sample stops immediately when it fails the first test. - Use our pre-evaluated results (see LLM-generated code)
- Use HumanEval+ Mini
Note
π Try out HumanEvalPlus-Mini
! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a --mini
flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with --test-details
).
docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini
# ...Or locally β οΈ
# evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini
We also share pre-generated code samples from LLMs we have evaluated:
- HumanEval+: See the attachment of our v0.1.0 release.
- MBPP+: See the attachment of our v0.2.0 release.
Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip
.
You can unzip them to a folder named like ${model_name}_temp_${temperature}
and run the evaluation from scratch with:
evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}
To use these tools, please first install the repository from GitHub:
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements-tools.txt
Check LLM-produced code and answer the following questions:
- Is the generation entirely done for all samples / all problems in the dataset?
- Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)
python tools/checker.py --samples samples.jsonl --dataset [humaneval|mbpp]
# --samples can also be a directory organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py
LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.
# π‘ If you are storing codes in directories:
python tools/sanitize.py --samples samples.jsonl --dataset [humaneval|mbpp]
# Sanitized code will be produced to `samples-sanitized.jsonl`
# π‘ If you are storing codes in directories:
python tools/sanitize.py --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
You should now further check the validity of sanitized code with tools/checker.py
.
Sometimes (e.g., Chat models) there might be some natural language lines that impact the compilation.
You might use --rm-prefix-lines
to cut those NL lines with a prefix (e.g., --rm-prefix-lines "Here's"
).
python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`
evalplus
is the package name.${DATASET}_plus
is the name of dataset applied withevalplus
.
@article{evalplus,
title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
journal={arXiv preprint arXiv:2305.01210},
year={2023},
}