Skip to content

Latest commit

 

History

History
60 lines (48 loc) · 2.82 KB

README.md

File metadata and controls

60 lines (48 loc) · 2.82 KB

🔥 Code and data for our paper "Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark"

Online_Appedix.pdf shows more experimental results of LLMs' performance on various subdomains in DomainCodeBench.

💡 Overview

In this paper, we propose DomainCodeBench, a new multi-domain, multi-language code generation benchmark.

We find that previous code generation benchmarks focus on general-purpose programming tasks, leaving LLMs' domain-specific programming capabilities largely underexplored. To fill this gap, we construct DomainCodeBench according to the following process to reveal the code generation capabilities of today's mainstream LLMs in popular software application domains.

Here is the construction pipeline of DomainCodeBench: Construction pipeline of HumanEvo

Here is an example of a task instance in DomainCodeBench: Construction pipeline of HumanEvo

⚡️ Quick Start

  1. clone this reposiotry.
  2. run conda env create -f environment.yml to create a conda environment named DomainCodeBench.

🚀 Evaluation

  • evaluation
    • CodeBleu: contains the entire set of evaluation tools of DomainCodeBench
      • keywords: In CodeBLEU, keywords improve similarity accuracy by aligning assessment with code syntax and semantics.
      • parser
        • build
        • DFG: contains data flow graph extraction tools for all the covered languages
        • vendor
        • utils.py
      • bleu.py
      • calc_code_bleu.py
      • dataflow_match.py
      • syntax_match.pycode
      • utils.py
      • weighted_ngram_match.py
      • run_script.sh

To run evaluation on DomainCodeBench, you can put your generated results in the generation_result folder, following the format: generation_result/docststring_only/{model_name}, and run the following command in the DomainCodeBench/evaluation/CodeBleu directory:

python calc_code_bleu.py --model {model_name}  --predict_result_base_path generation_result/docstring_only  

Or you can just modify the source code in DomainCodeBench/evaluation/CodeBleu/calc_code_bleu.py to define the generted result by yourself.

⚖️ Benchmark DomainCodeBench

DomainCodeBench covers a total of 12 domains, detailed infomation about MuticodeBench please refer to the DomainCodeBench folder.

  • DomainCodeBench
    • ./DomainCodeBench/Cloud_service.json
    • ./DomainCodeBench/Block_chain.json
    • ./DomainCodeBench/Desktop_application.json
    • ./DomainCodeBench/Distributed_system.json
    • ./DomainCodeBench/Game.json
    • ./DomainCodeBench/Mobile.json
    • ./DomainCodeBench/Web.json
    • ./DomainCodeBench/Robot.json
    • ./DomainCodeBench/Enterprise_application.json
    • ./DomainCodeBench/Data_analysis.json
    • ./DomainCodeBench/Deep_learning.json
    • ./DomainCodeBench/IoT.json