Skip to content

Latest commit

 

History

History
127 lines (108 loc) · 5.24 KB

README.md

File metadata and controls

127 lines (108 loc) · 5.24 KB

CodeGen4Libs

This repo is for the ASE2023 paper titled "CodeGen4Libs: A Two-stage Approach for Library-oriented Code Generation".

Updates


  • 2023-09-10: Initial Benchmark Release
  • 2023-10-04: add Huggingface support

TODO


  • Model Implementations

Huggingface support

Hugging Face Datasets

Usage

from datasets import load_dataset
dataset = load_dataset("FudanSELab/CodeGen4Libs")

Dataset Structure

DatasetDict({
    train: Dataset({
        features: ['id', 'method', 'clean_method', 'doc', 'comment', 'method_name', 'extra', 'imports_info', 'libraries_info', 'input_str', 'input_ids', 'tokenized_input_str', 'input_token_length', 'labels', 'tokenized_labels_str', 'labels_token_length', 'retrieved_imports_info', 'retrieved_code', 'imports', 'cluster_imports_info', 'libraries', 'attention_mask'],   
        num_rows: 391811
    })
    validation: Dataset({
        features: ['id', 'method', 'clean_method', 'doc', 'comment', 'method_name', 'extra', 'imports_info', 'libraries_info', 'input_str', 'input_ids', 'tokenized_input_str', 'input_token_length', 'labels', 'tokenized_labels_str', 'labels_token_length', 'retrieved_imports_info', 'retrieved_code', 'imports', 'cluster_imports_info', 'libraries', 'attention_mask'],   
        num_rows: 5967
    })
    test: Dataset({
        features: ['id', 'method', 'clean_method', 'doc', 'comment', 'method_name', 'extra', 'imports_info', 'libraries_info', 'input_str', 'input_ids', 'tokenized_input_str', 'input_token_length', 'labels', 'tokenized_labels_str', 'labels_token_length', 'retrieved_imports_info', 'retrieved_code', 'imports', 'cluster_imports_info', 'libraries', 'attention_mask'],   
        num_rows: 6002
    })
})

Benchmark Format


Benchmark has been meticulously structured and saved in the DatasetDict format, accessible at Dataset and Models of CodeGen4Libs. The specific data fields for each tuple are delineated as follows:

  • id: the unique identifier for each tuple.

  • method: the original method-level code for each tuple.

  • clean_method: the ground-truth method-level code for each task.

  • doc: the document of method-level code for each tuple.

  • comment: the natural language description for each tuple.

  • method_name: the name of the method.

  • extra: extra information on the code repository to which the method level code belongs.

    • license: the license of code repository.
    • path: the path of code repository.
    • repo_name: the name of code repository.
    • size: the size of code repository.
  • imports_info: the import statements for each tuple.

  • libraries_info: the libraries info for each tuple.

  • input_str: the design of model input.

  • input_ids: the ids of tokenized input.

  • tokenized_input_str: the tokenized input.

  • input_token_length: the length of the tokenized input.

  • labels: the ids of tokenized output.

  • tokenized_labels_str: the tokenized output.

  • labels_token_length: the length of the the tokenized output.

  • retrieved_imports_info: the retrieved import statements for each tuple.

  • retrieved_code: the retrieved method-level code for each tuple.

  • imports: the imported packages of each import statement.

  • cluster_imports_info: cluster import information of code.

  • libraries: libraries used by the code.

  • attention_mask: attention mask for the input.

Models Download


NL+Libs+Imports(Ret)->Imports

NL+Libs->Imports

NL+Libs->Code

NL+Libs+Imports(Gen)->Code

NL+Libs+Code(Ret)->Code

NL+Libs+Imports(Gen)+Code(Ret)->Code

Usage


  1. Environment Setup
from transformers import RobertaTokenizer, T5ForConditionalGeneration
  1. Load Model
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
# add <code>, </code> as special tokens
tokenizer.add_special_tokens(
    {"additional_special_tokens": tokenizer.special_tokens_map["additional_special_tokens"] + ["<code>", "</code>"]}
)
# load model
model_name = "codegen4lib_base"
model_dir = PathUtil.finetune_model(f"{version}/best_{model_name}")
model = T5ForConditionalGeneration.from_pretrained(model_dir)
  1. Genetrate Example
input_str = "Gets the detailed information for a given agent pool"
input_ids = tokenizer(input_str, return_tensors="pt").input_ids
input_ids = torch.as_tensor(input_ids).to("cuda")

outputs = model.generate(input_ids, max_length=512)
print("output_str: ", tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

@inproceedings{ase2023codegen4libs,
  author       = {Mingwei Liu and Tianyong Yang and Yiling Lou and Xueying Du and Ying Wang and and Xin Peng},
  title        = {{CodeGen4Libs}: A Two-stage Approach for Library-oriented Code Generation},
  booktitle    = {38th {IEEE/ACM} International Conference on Automated Software Engineering,
                  {ASE} 2023, Kirchberg, Luxembourg, September 11-15, 2023},
  pages        = {0--0},
  publisher    = {{IEEE}},
  year         = {2023},
}