Skip to content

marisol-barrientos/vulnerability_data_analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Github vulnerability data collector

Run this project via docker using docker-compose. Also an excellent tutorial on understanding docker compose can be found here

Example:

open a terminal inside the repository then just run

  1. docker-compose up. This will start 2 containers, one will run the apps and the other will host the redis storage.
  2. docker ps to see the running containers.
  3. docker exec -it github-commit-crawler_app_1 bash where github-commit-crawler_app_1 is the name of the container which can be different in another machine, so make sure the name is correct using the docker ps command.
  4. Once inside the container, cd src
  5. Make sure Tor service is running using service tor status if not running then run service tor start
  6. To run the python files: python3 name_of_file.py

app/src/crawl_advisories.py

This will crawl reported vulnerable libraries from the GitHub Advisory database and write those to files (csv & json), we are writing to json because it more convenient to load back and use further but we will need the csv for data analysis.

app/src/data_collector.py

This will read back collected vulnerable libraries and use GitHub's code search API to search the whole GitHub for any usage of the libraries. It will write back the information to csv files. The code search API has a few limitations.

  1. We only get 1000 matching results.
  2. API is very simple so no advanced searching can be done.
Running with redirecting console output to file

nohup python3 -u data_collector.py > ../data/data_collector_out.txt 2>&1 &

  1. nohup let's you run a process in the background, so if you close the terminal the process will keep on running.
  2. Make sure you take a note of the process id which the nohup command will return. You can also find the proccess id from ps aux | grep data_collector though.
  3. To kill the process: kill {process id} for example: kill 472
  4. The -u flags tells Python to flush print immediately instead of storing it in the buffer.
  5. > redirect output to the specified files and 2>&1 redirects both standard out and error to the same file.
  6. Finally do not forget to put & and the very end, this will tell the terminal shell that after executing nohup you want to exit it.

Data collection

Data are collected in text files and redis. The text files contain:

  1. data/npm_advisories (csv & json) both contain reported advisories from GitHub Advisory database.
  2. data/vulnerabilities.csv contain usage of the libraries found in the advisory database, they may or may not use vulnerable version of the library.

IP masking

We have used TOR to maks IP, read more about the process here.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 52.7%
  • Jupyter Notebook 43.2%
  • Shell 3.5%
  • Dockerfile 0.6%