PathEval: Evaluation Set for Directed Test Input Generation

This is an evaluation set for the problem of directed/targeted test input generation, especially targeting Large Language Models (LLMs). The goal of directed test input (a.k.a. targeted test input) generation is to automatically generate test inputs to reach a certain code location or produce a particular result.

For example:

// EXAMPLE 1:
// generate an input of `s` that can reach the target line:
int arr(char *s){
    int symvar = s[0] - 48;
    int ary[] ={1,2,3,4,5};
    if (ary[symvar%5] == 5){
        // target line
    }
}

// EXAMPLE 2:
// or produce the expected output
int execute(char* s) {
    int ret = system(s);
    if(ret == 0){
        return OK; // expected
    }
    return ERR;
}

Targeted input generation plays a crucial role in various software engineering and security tasks, including fuzzing, bug reproduction (where the target is the bug location), and test suite augmentation (where the target is the specific code to be covered). It has also been extensively integrated with conventional testing tools to enhance coverage and overall performance.

Constraint-based techniques, such as symbolic execution and concolic testing, have been well-explored in this problem while Large Language Models (LLMs) have demonstrated exceptionally good performance in code understanding and reasoning. We use PathEval to benchmark and evaluate the ability of LLMs to solve the problem of directed test input generation.

LLMs and constraint-based tools have distinct advantages and disadvantages towards this problem. For instance, it is easy for constrint-based tools to generate the target input for the above EXAMPLE 1 while being challenging for LLMs. Interestingly, it is the opposite for EXAMPLE 2.

Installation

Linux (Debian)

Install gcc, g++, python3 and openjdk by the following commands.

apt update
apt install -y -q build-essential gcc g++ python3 python3-pip openjdk-11-jdk-headless libssl-dev
python3 -m pip install tqdm

git clone https://github.com/CGCL-codes/PathEval.git
cd PathEval

Docker

git clone https://github.com/CGCL-codes/PathEval.git
cd PathEval

docker build -t patheval .
docker run -it --rm patheval /bin/bash

# cd /work in container

Usage

The evaluation will compile and execute untrusted model-generated data and code (see scripts/*_check.py). It is strongly encouraged to run this project in a sandbox (e.g., docker container).

Users can simply use this dataset through a couple of APIs.

Example

from patheval import set_dataset, read_problems, evaluate_one
# select the dataset that corresponds to the language
set_dataset("patheval_cpp") # or "patheval_java" or "patheval_py" or "logic_bombs_c"
# load the problems
problems = read_problems()
# give the completion from your LLMs and the evaluate result will return.

# completion = query(...)
evaluate_one(problems[0], completion)['pass'] # True / False

Or you can validate offline with the script patheval.py:

python3 patheval.py --dataset patheval_java --input sample.jsonl

For the optional dataset see the documentation for set_dataset below. The format of sample.jsonl is as follows, the order doesn't matter.

{"index": 486, "completion": "\"Mary had a little lamb\", 4"}
{"index": 532, "completion": "\"Hello,Hello,world !\" "}
{"index": 572, "completion": "Arrays.asList(1.0, 2., 3.)"}
...

We expect that each question can have one or more rows of answers (multiple rows of data with the same index). If a question does not have a corresponding answer, then it fails by default.

APIs

set_dataset(dataset_name)

Purpose: Selects the dataset for evaluation.
Parameters:
- dataset_name (str): The name of the dataset to use. Valid values are:
  - "patheval_cpp": C++ dataset
  - "patheval_java": Java dataset
  - "patheval_py": Python dataset
  - "logic_bombs_c": logic bombs dataset (in C language)
Returns: None

read_problems()

Purpose: Loads the problems from the selected dataset.
Parameters: None
Returns: problems (list): A list of problems.

evaluate_one(problem, completion)

Purpose: Evaluate the given completion for a single problem.
Parameters:
- problem (object): One problem from the read_problems() returned list.
- completion (str): The completion generated by LLMs.
Returns: the input problem dictionary, with an additional key:
- "pass" (bool): Indicates whether the completion passes the problem

Sample

For each sample, the following information is provided:

Key	Description
index	index in this dataset
humaneval_task_id	task id in HumanEval
focal_method_name	function name of focal method
focal_method_para	parameters of focal method
focal_method_return_type	return value type of focal method
focal_method	code of focal method
target	code of target

In logic_bombs samples, humaneval_task_id is replaced by logic_bombs_task_id.

The samples are placed under the data folder in .jsonl files, the samples in the three different files are semantically equivalent, but are implemented in different programming languages.

An example C++ sample is shown as follow.

{
  "index": 180,
  "humaneval_task_id": "CPP/53",
  "focal_method_name": "add",
  "focal_method_para": "(int x,int y)",
  "focal_method_return_type": "int",
  "focal_method": "#include<stdio.h>\n#include<stdlib.h>\nusing namespace std;\n#include<algorithm>\n#include<math.h>\nint add(int x,int y){\n    return x+y;\n}",
  "target": "#undef NDEBUG\n#include<assert.h>\n\nint main(){\n\tauto result = add(<FILL_ME>);\n\tassert(result==5);\n}"
}

We use <FILL_ME> to mark the position for LLMs to fill in.

DataSet Extension

We are extending the dataset from real-world open-source projects, a sample is shown under data/rw. These samples will be added to this repo soon.

Known Issues

There exists subtle difference in the number of samples in the dataset for the three programming languages due to the fact that this dataset was converted from HumanEval-X in a automated manner, where a very small number of samples failed in the automated process and were thus discarded.

Citation

This is originally created for our paper "Towards Understanding the Effectiveness of Large Language Models on Directed Test Input Generation" (ASE 2024). The preview version is here.

@inproceedings{DBLP:conf/kbse/Jiang0CS024,
  author       = {Zongze Jiang and
                  Ming Wen and
                  Jialun Cao and
                  Xuanhua Shi and
                  Hai Jin},
  editor       = {Vladimir Filkov and
                  Baishakhi Ray and
                  Minghui Zhou},
  title        = {Towards Understanding the Effectiveness of Large Language Models on
                  Directed Test Input Generation},
  booktitle    = {Proceedings of the 39th {IEEE/ACM} International Conference on Automated
                  Software Engineering, {ASE} 2024, Sacramento, CA, USA, October 27
                  - November 1, 2024},
  pages        = {1408--1420},
  publisher    = {{ACM}},
  year         = {2024},
  url          = {https://doi.org/10.1145/3691620.3695513},
  doi          = {10.1145/3691620.3695513},
  timestamp    = {Mon, 03 Mar 2025 21:16:48 +0100},
  biburl       = {https://dblp.org/rec/conf/kbse/Jiang0CS024.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
scripts		scripts
.gitattributes		.gitattributes
Dockerfile		Dockerfile
LICENSE		LICENSE
PathEval_Preprint.pdf		PathEval_Preprint.pdf
README.md		README.md
example.py		example.py
patheval.py		patheval.py
sample.jsonl		sample.jsonl
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PathEval: Evaluation Set for Directed Test Input Generation

Installation

Linux (Debian)

Docker

Usage

Sample

DataSet Extension

Known Issues

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

CGCL-codes/PathEval

Folders and files

Latest commit

History

Repository files navigation

PathEval: Evaluation Set for Directed Test Input Generation

Installation

Linux (Debian)

Docker

Usage

Sample

DataSet Extension

Known Issues

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages