Release notes
These are the release notes of the initial release of the BigCode Evaluation Harness.
Goals
The framework aims to achieve the following goals:
- Reproducibility: Making it easy to report and reproduce results.
- Ease-of-use: Providing access to a diverse range of code benchmarks through a unified interface.
- Efficiency: Leveraging data parallelism on multiple GPUs to generate benchmark solutions quickly.
- Isolation: Using Docker containers for executing the generated solutions.
Release overview
The framework supports the following features & tasks:
-
Features:
- Any autoregressive model available on Hugging Face hub can be used, but we recommend using code generation models trained specifically on Code.
- We provide Multi-GPU text generation with
accelerate
for multi-sample problems and Dockerfiles for evaluating on Docker containers for security and reproducibility.
-
Tasks:
- 4 code generation Python tasks (with unit tests): HumanEval, APPS, MBPP and DS-1000 for both completion (left-to-right) and insertion (FIM) mode.
- MultiPL-E evaluation suite (HumanEval translated into 18 programming languages).
- Pal Program-aided Language Models evaluation for grade school math problems : GSM8K and GSM-HARD. These problems are solved by generating reasoning chains of text and code.
- Code to text task from CodeXGLUE (zero-shot & fine-tuning) for 6 languages: Python, Go, Ruby, Java, JavaScript and PHP. Documentation translation task from CodeXGLUE.
- CoNaLa for Python code generation (2-shot setting and evaluation with BLEU score).
- Concode for Java code generation (2-shot setting and evaluation with BLEU score).
- 3 multilingual downstream classification tasks: Java Complexity prediction, Java code equivalence prediction, C code defect prediction.
More details about each task can be found in the documentation in docs/README.md
.
Main Contributors
Full Changelog: https://github.com/bigcode-project/bigcode-evaluation-harness/commits/v0.1.0