Skip to content

Initial release of BigCode Evaluation Harness

Latest
Compare
Choose a tag to compare
@loubnabnl loubnabnl released this 25 May 09:37
· 494 commits to main since this release
04b3493

Release notes

These are the release notes of the initial release of the BigCode Evaluation Harness.

Goals

The framework aims to achieve the following goals:

  • Reproducibility: Making it easy to report and reproduce results.
  • Ease-of-use: Providing access to a diverse range of code benchmarks through a unified interface.
  • Efficiency: Leveraging data parallelism on multiple GPUs to generate benchmark solutions quickly.
  • Isolation: Using Docker containers for executing the generated solutions.

Release overview

The framework supports the following features & tasks:

  • Features:

    • Any autoregressive model available on Hugging Face hub can be used, but we recommend using code generation models trained specifically on Code.
    • We provide Multi-GPU text generation with accelerate for multi-sample problems and Dockerfiles for evaluating on Docker containers for security and reproducibility.
  • Tasks:

    • 4 code generation Python tasks (with unit tests): HumanEval, APPS, MBPP and DS-1000 for both completion (left-to-right) and insertion (FIM) mode.
    • MultiPL-E evaluation suite (HumanEval translated into 18 programming languages).
    • Pal Program-aided Language Models evaluation for grade school math problems : GSM8K and GSM-HARD. These problems are solved by generating reasoning chains of text and code.
    • Code to text task from CodeXGLUE (zero-shot & fine-tuning) for 6 languages: Python, Go, Ruby, Java, JavaScript and PHP. Documentation translation task from CodeXGLUE.
    • CoNaLa for Python code generation (2-shot setting and evaluation with BLEU score).
    • Concode for Java code generation (2-shot setting and evaluation with BLEU score).
    • 3 multilingual downstream classification tasks: Java Complexity prediction, Java code equivalence prediction, C code defect prediction.

More details about each task can be found in the documentation in docs/README.md.

Main Contributors

Full Changelog: https://github.com/bigcode-project/bigcode-evaluation-harness/commits/v0.1.0