Introduction

This is the code and dataset used for the ANLP term project and is based on the paper "NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness". The original code source is here.

Repository Contents

Original Datasets: The original NoFunEval Dataset.
Swapped Datasets: Modified dataset where we replace the source code by the target code in input.
Evaluation scripts: Scripts to evaluate LMs by taking input from examples in NoFunEval and producing score@k scores for the metrics reported in the paper: DiffBleu, Average SpeedUp, CodeQL, CodeQL-DiffBleu.

Datasets

Runtime Efficiency: Improving algorithmic time complexity of code for faster run-time.
Maintainability: Improving code readability and style, and following the best programming practices.
Latency: Optimizing code for faster response times in Android applications.
Resource Utilization: Optimizing code for lower resource utilization on edge devices.
Security: Fixing specific security vulnerabilities in the input code.
HumanEvalClassify: Classifying between the buggy and correct snippets of code.

Environment Setup

Create a conda environment and activate it (Python version must be 3.8, we face errors during installation for more recent versions).

conda create -n nofuneval python=3.8
conda activate nofuneval

Install all the requirements (except vllm):

bash setup.sh

Install vllm at the end:

pip install vllm

Generation

Original dataset

For all models from HuggingFace we use the below command:

python3 src/nofunedit_generation.py --data_subset <subset from nofunedit: eg-latency> --model_path <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt> --num_samples <number of samples to be generated: eg-1> --precision <floating point format: eg-fp16> --batch_size <number of examples to send to llm engine at once: eg-1>

For GPT-4o, we use the below command:

python3 src/gpt4_nofun_edit.py --data_subset <subset from nofunedit: eg-latency> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt>

Swapped Dataset

For all models from HuggingFace we use the below command:

python3 src/nofunedit_generation_swapped.py --data_subset <subset from nofunedit: eg-latency> --model_path <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt> --num_samples <number of samples to be generated: eg-1> --precision <floating point format: eg-fp16> --batch_size <number of examples to send to llm engine at once: eg-1>

For GPT-4o, we use the below command:

python3 src/gpt4_nofun_edit_swapped.py --data_subset <subset from nofunedit: eg-latency> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt>

Classification

For open-source models on HuggingFace, run:

python3 src/classification_generation.py --data_subset <subset from non_func or humanevalclassify: eg-latency> --model <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt> --precision <floating point format: eg-fp16> --batch_size <number of examples to send to llm engine at once: eg-1>

For GPT-4o, run:

python3 src/gpt4_nofun_classify.py --data_subset <subset from non_func or humanevalclassify: eg-latency> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-4> --prompt <type of prompt to use from our dataset: eg-base_prompt>

Evaluation Scripts

Baseline Evaluation

python3 src/evaluation.py --data_subset <subset from nofunedit: eg-latency> --model_path <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --prompt <type of prompt to use from our dataset: eg-base_prompt> --num_samples <number of samples to be generated: eg-1> --score_k <K values for score@k: eg-1,5,10,20> --metric <eval_metric to be used: eg-diffbleu>

Example eval script

bash evaluation_example_script.sh

For maintainability and security subsets, first run src/evaluation.py for diffbleu and codeql metrics. After this, run the src/evaluation.py code to obtain the codeql-diffbleu metric.

Parameters

Parameter	Description
`data_subset`	The subset of data to use. Options: `latency`, `resource_util`, `maintainability`, `security`, `runtime_efficiency` for nofunedit. Additionally `humanevalclassify` for classification.
`model_path`	The path of the model from HF. Example: `WizardLM/WizardCoder-15B-V1.0`.
`prompt`	Prompt to use. Options: `base_prompt`, `one_shot`, `chain_of_thought`, `coding_concepts`.
`num_samples`	Number of samples to generate. Example: `1` (We used `1` for greedy and `20` for higher temperature). [NoFunEdit - Generation only]
`max_new_tokens`	Budget for new token generation for a model. Example: `1200` (NoFunEdit: We used `1200` for runtime_efficiency and security for all prompts other than CoT where `1500` was used. For others, we used `5192` or max possible limit. Classification: We used `4` for all generations).
`temperature`	Temperature for model generation. Example: `0` (We used `0` for greedy and `0.8` for 20 samples)
`score_k`	K vales for Score@K. Example: `1,5,10,20` (Should not be greater than num_samples and is comma separated, authors always report scores for K=1) [Eval only]
`metric`	Metric to be used for evaluation. Option: `diffbleu`, `codeql`, `codeql-diffbleu` (to be run after first two params are run), `classification`, `runtime` [Eval only]

VLLM Parameters (for generation)

Parameter	Description
`batch-size`	Batch size. Default: `1`
`precision`	Floating point format: Default: `fp16`
`tensor_parallel_size`	Default: `1`
`swap_space`	The size (GiB) of CPU memory per GPU to use as swap space: Default: `12`

Classification as Edit Evaluation

The src/evaluate_classification_as_edit.py handles prediction and evaluation of our proposed approach.

Parameters

Argument	Type	Default	Description
`--data_subset`	str	`"resource_util"`	Dataset subset to use: `latency`/`resource_util`/`runtime_efficiency`/`maintenance`/`security`
`--model`	str	`"CodeLlama-7b-Instruct-hf"`	Name of model to use
`--prompt`	str	`"coding_concepts"`	Prompt type: `base_prompt`/`coding_concepts`/`chain_of_thought`/`one_shot`
`--num_samples`	int	1	Number of samples generated
`--metric`	str	`"code_bert_score"`	Evaluation metric: `git_diff`/`edit_distance`/`bleu`/`codebleu`/`bert_score`/`code_bert_score`

You will also need to change generations_path_original, generations_path_swapped and results_path appropriately.

You will need to run the script as such -

python src/evaluate_classification_as_edit.py {ARGS}

Qualitative Examples

The error_analysis directory contains full text of the examples mentioned in Appendix A of our report.

GPU Hardware

All our reproducibility experiments have been run using AWS g6e.xlarge instance which consists of a single NVIDIA L40S GPU with ~46GB memory. For GPT-4o, we query LiteLLM proxy using the API credits for this semester.

Model outputs

All the model outputs from original dataset are provided in generations_original and all the model outputs from the swapped dataset are provided in generations_swapped.

Code Reference

The evaluation code for runtime efficiency has been derived from the PIE codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
datasets		datasets
datasets_swapped		datasets_swapped
error_analysis		error_analysis
generations_original		generations_original
generations_swapped/edit		generations_swapped/edit
src		src
CONTRIBUTIONS.md		CONTRIBUTIONS.md
CONTRIBUTIONS_assgn3.md		CONTRIBUTIONS_assgn3.md
README.md		README.md
check.py		check.py
create_dataset.py		create_dataset.py
example_eval_script.sh		example_eval_script.sh
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Repository Contents

Datasets

Environment Setup

Generation

Original dataset

Swapped Dataset

Classification

Evaluation Scripts

Baseline Evaluation

Example eval script

Parameters

VLLM Parameters (for generation)

Classification as Edit Evaluation

Parameters

Qualitative Examples

GPU Hardware

Model outputs

Code Reference

About

Releases

Packages

Contributors 3

Languages

adityasoni9998/ANLP_Term_Project_NoFunEval

Folders and files

Latest commit

History

Repository files navigation

Introduction

Repository Contents

Datasets

Environment Setup

Generation

Original dataset

Swapped Dataset

Classification

Evaluation Scripts

Baseline Evaluation

Example eval script

Parameters

VLLM Parameters (for generation)

Classification as Edit Evaluation

Parameters

Qualitative Examples

GPU Hardware

Model outputs

Code Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages