Skip to content

Commit

Permalink
Upload data package from FigShare, https://doi.org/10.6084/m9.figshar…
Browse files Browse the repository at this point in the history
  • Loading branch information
bstee615 committed Jan 18, 2024
0 parents commit 4e48fd7
Show file tree
Hide file tree
Showing 8,540 changed files with 7,825,449 additions and 0 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.csv filter=lfs diff=lfs merge=lfs -text
*.log filter=lfs diff=lfs merge=lfs -text
*.out filter=lfs diff=lfs merge=lfs -text
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.csv
*.log
*.out
Binary file added CWE Groupings.xlsx
Binary file not shown.
Binary file added Empirical results.xlsx
Binary file not shown.
Binary file added Model+ExplainabilityModel Survey.xlsx
Binary file not shown.
162 changes: 162 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Data package for "An Empirical Study of Deep Learning Models for Vulnerability Detection"

## Organization

Below is an annotated map of the package's organization.

```
.
├── cross_sections......................... cross-sections of the Devign and MSR datasets used for RQ2, RQ4, and RQ5.
│   ├── code............................... scripts used to generate the data for the cross-sections.
│   ├── logs............................... logs from running the scripts.
│   ├── bug_type........................... dataset examples used for RQ2.
│   ├── cross-project...................... dataset examples used for RQ5.
│   ├── dataset_size....................... dataset examples used for RQ4.
│   ├── project_diversity.................. dataset examples used for RQ5.
│   └── README.md.......................... detailed explanation of the files in each cross-section.
├── datasets............................... the original datasets. The dataset files are not included for brevity, but a download link is provided.
│   ├── Devign............................. the Devign dataset link and CodeXGLUE splits.
│   └── MSR................................ the MSR dataset link (as splits are not provided by Fan et al, we used the splits from LineVul).
│  
├── interpretability....................... interpretability analysis presented in RQ6.
│   ├── analysis........................... used to analyze the results of the importance scores.
│   ├── data_collection.................... used to collect importance scores from the models.
│   └── examples........................... Devign test dataset examples which we annotated with importance scores and analyzed manually.
│   └── RQ6................................ full listings for the examples shown in RQ6.
├── logistic_regression+stability.......... scripts used to perform logistic regression in RQ3 and stability analysis in RQ1.
│   ├── easyhard_lr.py..................... script for logistic regression.
│   └── stability.py....................... script for stability analysis.
│   └── eval_exports....................... data input into logistic_regression and stability.
├── models................................. the code used to run the deep learning models.
│   └── <model>............................ the directory for each model.
│      ├── code........................... scripts used to generate the plots.
│      ├── logs........................... logs from running training on the cross-section and reproduction datasets.
│      └── scripts........................ scripts used as entry points into the code.
├── plots.................................. used to generate the plots shown in the paper.
│   ├── code............................... scripts used to generate the plots.
│   ├── data............................... data input into the scripts.
│   └── plots.............................. the resulting plots.
├── CWE Groupings.xlsx..................... the full list of CWE groups in Table V.
├── Model+ExplainabilityModel Survey.xlsx.. results of the model survey and explainablity tool surveys.
└── Empirical results.xlsx................. the main results of the model reproduction and RQs 2,3,4,5,6.
└── README.md
```

## Model overview

The code, scripts, and logs for the following models are included.

Full experiment data:
- Devign
- ReVeal
- ReGVD
- CodeBERT
- VulBERTa-CNN
- VulBERTa-MLP
- PLBART
- LineVul
- Code2Vec

Model reproduction data:
- SySeVR
- VulDeeLocator

Each model directory includes `code`, `scripts`, and `logs`.
- `code` contains the code distributed by the implementers of the model, plus our
improvements to work on the MSR dataset and select cross-sections of the data.
- `scripts` contains the scripts we used to train/evaluate the models.
- `logs` contains the logged output from our reproduction and experiments.

The following models share parts of their implementation, and so are grouped in the same folder.
`VulBERTa-CNN` and `VulBERTa-MLP` are included in the directory named `VulBERTa-CNN_VulBERTa-MLP`.
`Devign` and `ReVeal` are included in the directory named `Devign_ReVeal`.

We reproduced the `CodeBERT` and the `VulBERTa` models on two different platforms during our experiments,
so these include two versions of the scripts which are mostly similar, except for minor implementation details.

We could not reproduce `VulDeeLocator` and `SySeVR`. VulDeeLocator requires the code to be compiled. Devign/MSR datasets contain many different real-world open-source projects; compiling all projects will not be practical. SySeVR needs to compute and label program slices. Labeling program slices correctly for large datasets like MSR and Devign is a real challenge. Their paper ([link](https://ieeexplore.ieee.org/document/9321538)) mentioned that the authors manually checked 2,605 examples and corrected 1,641 of the labels.

All models take as input the data in `datasets`; however, the file paths are different from the platforms we used,
so the training/evaluation will not be directly able to run.

Some models take as input a `jsonl` format, which can be generated with the script: `cross_sections/code/data_creator.ipynb`.

## Code overview

These programs take data as input, but the file paths and `PYTHONPATH` are different from our platform, so they may not be directly able to run.
We attempted to include basic data in the same location so that the input and outputs are observable.

## Dataset overview

Dataset sources:
- Devign: https://drive.google.com/file/d/1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF/view?usp=sharing
- We used the canonical splits provided by the CodeXGLUE project (https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection).
- MSR: https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing
- As Fan et al. does not provide canonical dataset splits, we used the same splits generated by the LineVul authors.

## References

```
Devign dataset:
@article{zhou_devign_2019,
title = {Devign: {Effective} vulnerability identification by learning comprehensive program semantics via graph neural networks},
volume = {32},
journal = {Advances in Neural Information Processing Systems},
author = {Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang},
year = {2019},
pages = {1--11},
annote = {Graph NNFunction level},
}
MSR dataset:
@article{fan_cc_2020,
title = {A {C}/{C}++ {Code} {Vulnerability} {Dataset} with {Code} {Changes} and {CVE} {Summaries}},
issn = {9781450379571},
doi = {10.1145/3379597.3387501},
journal = {Proceedings - 2020 IEEE/ACM 17th International Conference on Mining Software Repositories, MSR 2020},
author = {Fan, Jiahao and Li, Yi and Wang, Shaohua and Nguyen, Tien N.},
year = {2020},
keywords = {C/C++ Code, Code Changes, Common Vulnerabilities and Exposures},
pages = {508--512},
}
CodeXGLUE:
@article{lu_codexglue_2021,
title = {{CodeXGLUE}: {A} {Machine} {Learning} {Benchmark} {Dataset} for {Code} {Understanding} and {Generation}},
url = {http://arxiv.org/abs/2102.04664},
year = {2021},
keywords = {machine learning, naturalness of software, program understanding},
}
tree-sitter:
@misc{https://doi.org/10.5281/zenodo.6326492,
doi = {10.5281/ZENODO.6326492},
url = {https://zenodo.org/record/6326492},
author = {Brunsfeld, Max and Thomson, Patrick and Hlynskyi, Andrew and Vera, Josh and Turnbull, Phil and Clem, Timothy and Creager, Douglas and Helwer, Andrew and Rix, Rob and Van Antwerpen, Hendrik and Davis, Michael and {, Ika} and {Tuấn-Anh Nguyễn} and Brunk, Stafford and {Niranjan Hasabnis} and {Bfredl} and {Mingkai Dong} and Panteleev, Vladimir and {Ikrima} and Kalt, Steven and Lampe, Kolja and Pinkus, Alex and Schmitz, Mark and Krupcale, Matthew and {Narpfel} and Gallegos, Santos and Martí, Vicent and {, Edgar} and Fraser, George},
title = {tree-sitter/tree-sitter: v0.20.6},
publisher = {Zenodo},
year = {2022},
copyright = {Open Access}
}
scikit-learn:
@article{10.5555/1953048.2078195,
author = {Pedregosa, Fabian and Varoquaux, Ga\"{e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and Cournapeau, David and Brucher, Matthieu and Perrot, Matthieu and Duchesnay, \'{E}douard},
title = {Scikit-Learn: Machine Learning in Python},
year = {2011},
issue_date = {2/1/2011},
publisher = {JMLR.org},
volume = {12},
number = {null},
issn = {1532-4435},
journal = {J. Mach. Learn. Res.},
month = {nov},
pages = {2825–2830},
numpages = {6}
}
```
94 changes: 94 additions & 0 deletions cross_sections/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# General guidelines

`example_index` sequentially labels the canonical versions of the two datasets. The indices correspond to rows in these files.
- MSR: `MSR_data_cleaned.csv` https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing
- Devign: `function.json` https://drive.google.com/uc?id=1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF
For example, the 1st item in `function.json` will have `example_index == 0`, 50th item has `example_index == 49`, ...
For example, the 1st row in `MSR_data_cleaned.csv` will have `example_index == 0`, 50th row has `example_index == 49`, ...

For combined dataset, both `dataset` and `example_index` are needed to uniquely identify a dataset example.
**DEPRECATED:** _For MSR, include both `before` and `after` versions of code for each `example_index` in the same dataset partition._
We are now excluding all `after` examples from MSR.

# dataset_size

Meaning of filenames:
- `imbalanced_XXX.csv`: proportion XXX sampled from the imbalanced version of combined dataset (Devign + MSR). `imbalanced_1.0.csv` contains the entire combined dataset.
- `balanced_XXX.csv`: balanced version of `imbalanced_XXX.csv` (excludes rows from MSR where `vul == 0`).

Label proportions:
```
imbalanced balanced
target
0.0 0.897009 0.524416
1.0 0.102991 0.475584
```

Meaning of columns:
- dataset: MSR or Devign, which dataset this example came from
- example_index: numerical index for the example which is unique _within a dataset_ (but not unique in combined datasets)
- project: metadata, name of the project this code came from
- commit_id: metadata, name of the commit id this code came from
- split: dataset split to which this example is allocated

# cross-project and project_diversity

Note: `fold_XXX_holdout` is identical for the same fold index in both cross-project and project-diversity settings. They are provided separately in both settings for convenience.

Label proportions of first fold:
```
cross_project_dataset
total: 199536
target
0 0.945373
1 0.054627
diverse_dataset
total: 92582
target
0 0.941911
1 0.058089
nondiverse_dataset
total: 91590
target
0 0.952113
1 0.047887
```

## cross-project

Meaning of filenames:
- `fold_XXX` denotes that the file belongs to fold number `XXX`.
- `fold_XXX_holdout.csv` is the holdout set, which is the set of ~10k examples whose projects are distinct from the projects in `fold_XXX_dataset.csv`.
- `fold_XXX_dataset.csv` is the mixed-project dataset, which is split into train/validation/test without separating projects.

Steps for the experiment.
For each fold `fold_XXX`:
1. Train on the `train` split in `fold_XXX_dataset.csv` and use `valid` split for validation.
2. Evaluate on both the `test` split of `fold_XXX_dataset.csv` and all the examples in `fold_XXX_holdout.csv`.

## project_diversity

Meaning of filenames:
- `nondiverse.csv` is the nondiverse set containing only examples from `Chrome`, which is split into training/validation splits. This set is the same for all folds, so it is not identified by `fold_XXX`.
- `fold_XXX` denotes that the file belongs to fold number `XXX`.
- `fold_XXX_diverse.csv` is the diverse set which excludes `Chrome` and the projects in `holdout`, and is split into training/validation splits.
- `fold_XXX_holdout.csv` is the holdout set, which is the set of ~10k examples whose projects are distinct from the projects in the diverse and nondiverse sets.

Steps for the experiment:
1. Train/validate the model on `train`/`valid` splits respectively in `nondiverse.csv`.
2. Train/validate for each `fold_XXX_diverse.csv`.
3. Evaluate both nondiverse and diverse models on `fold_XXX_holdout.csv` for each fold. Only evaluate the diverse model on its respective holdout set.

# bug_type

Each file contains the examples from a set of CWEs which have similar semantics.
All files are mutually exclusive.
All files contain train/valid/test splits.

Steps for the experiment.
For each file `bugtype_XXX`:
1. Train on the `train` split in `bugtype_XXX.csv` and use `valid` split for validation.
2. Evaluate on the `test` split of `bugtype_XXX.csv`.
2. Evaluate on the `test` split of each other `bugtype_YYY.csv`.
Loading

0 comments on commit 4e48fd7

Please sign in to comment.