Skip to content

Latest commit

 

History

History
30 lines (14 loc) · 2.16 KB

README.md

File metadata and controls

30 lines (14 loc) · 2.16 KB

This is our website for Identifying Vulnerable Third-Party Java Libraries from Textual Descriptions of Vulnerabilities and Libraries.

In folder tf-idf, we include the source code of our TF-IDF matcher. It is invoked by initializing a search_engine with a Maven corpus and directly invoking the search_topk_objects API.

In folder trainer_reranking, we include the source code of our BERT-FNN model, this script can be directly executed by setting the input/outut paths.

In NER_case.md, we show an motivating example case of our NER model, illustrating its effectiveness of identifying library-name entities.

We have open-sourced our VulLib and VeraJava dataset in the folder VulLib and VeraJava respectively.

In VulLib, VulLib/train.json, VulLib/valid.json, VulLib/test.json corresponds to the training, validate, and testing set of VulLib (old version); in our revision, we extend our dataset upto 2024.04.28 in VulLib/vullib_new.json, which includes 4,557 vulnerabilities and 7,280 <Vulnerability, Library> pairs.

Additionally, in RQ2, we split the testing set of into the zero-shot testing and the full-shot testing set, we also include them in VulLib/test_zero.json and VulLib/test_full.json.

In each file, an vulnerability entry includes its CVE id, vulnerability description, labels (affected libraries), Top-128 TF-IDF results and Top-128 results after BERT-FNN (output of VulLibMiner).

Now that we take the descriptions of library descriptions besides vulnerability descriptions as input, maven_corpus_new.json corresponds to the descriptions of all Java libraries in Maven (311,233).

An Java libary entry includes its name, description, tokens in its name, tokens in its description.

In VeraJava, VeraJava/cve_labels.csv and VeraJava/verajava.csv corresponds to the original dataset (including multiple programming languages) and the VeraJava dataset we extracted.

In Chronos, FastXML, LightXML, we include our baselines and their running scripts.

In data_scripts, we include our scripts for data-cleaning and generation for VulLibMiner and baselines.

In tf-idf-fig, we include our evaluation results of RQ4: the selection of hyper-parameters.