Github vulnerability data collector

Run this project via docker using docker-compose. Also an excellent tutorial on understanding docker compose can be found here

Example:

open a terminal inside the repository then just run

docker-compose up. This will start 2 containers, one will run the apps and the other will host the redis storage.
docker ps to see the running containers.
docker exec -it github-commit-crawler_app_1 bash where github-commit-crawler_app_1 is the name of the container which can be different in another machine, so make sure the name is correct using the docker ps command.
Once inside the container, cd src
Make sure Tor service is running using service tor status if not running then run service tor start
To run the python files: python3 name_of_file.py

app/src/crawl_advisories.py

This will crawl reported vulnerable libraries from the GitHub Advisory database and write those to files (csv & json), we are writing to json because it more convenient to load back and use further but we will need the csv for data analysis.

app/src/data_collector.py

This will read back collected vulnerable libraries and use GitHub's code search API to search the whole GitHub for any usage of the libraries. It will write back the information to csv files. The code search API has a few limitations.

We only get 1000 matching results.
API is very simple so no advanced searching can be done.

Running with redirecting console output to file

nohup python3 -u data_collector.py > ../data/data_collector_out.txt 2>&1 &

nohup let's you run a process in the background, so if you close the terminal the process will keep on running.
Make sure you take a note of the process id which the nohup command will return. You can also find the proccess id from ps aux | grep data_collector though.
To kill the process: kill {process id} for example: kill 472
The -u flags tells Python to flush print immediately instead of storing it in the buffer.
> redirect output to the specified files and 2>&1 redirects both standard out and error to the same file.
Finally do not forget to put & and the very end, this will tell the terminal shell that after executing nohup you want to exit it.

Data collection

Data are collected in text files and redis. The text files contain:

data/npm_advisories (csv & json) both contain reported advisories from GitHub Advisory database.
data/vulnerabilities.csv contain usage of the libraries found in the advisory database, they may or may not use vulnerable version of the library.

IP masking

We have used TOR to maks IP, read more about the process here.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Github vulnerability data collector

Example:

app/src/crawl_advisories.py

app/src/data_collector.py

Running with redirecting console output to file

Data collection

IP masking

About

Releases

Packages

Languages

License

marisol-barrientos/vulnerability_data_analysis

Folders and files

Latest commit

History

Repository files navigation

Github vulnerability data collector

Example:

app/src/crawl_advisories.py

app/src/data_collector.py

Running with redirecting console output to file

Data collection

IP masking

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages