Run this project via docker using docker-compose. Also an excellent tutorial on understanding docker compose can be found here
open a terminal inside the repository then just run
docker-compose up
. This will start 2 containers, one will run the apps and the other will host the redis storage.docker ps
to see the running containers.docker exec -it github-commit-crawler_app_1 bash
wheregithub-commit-crawler_app_1
is the name of the container which can be different in another machine, so make sure the name is correct using thedocker ps
command.- Once inside the container,
cd src
- Make sure Tor service is running using
service tor status
if not running then runservice tor start
- To run the python files:
python3 name_of_file.py
This will crawl reported vulnerable libraries from the GitHub Advisory database and write those to files (csv & json), we are writing to json because it more convenient to load back and use further but we will need the csv for data analysis.
This will read back collected vulnerable libraries and use GitHub's code search API to search the whole GitHub for any usage of the libraries. It will write back the information to csv files. The code search API has a few limitations.
- We only get 1000 matching results.
- API is very simple so no advanced searching can be done.
nohup python3 -u data_collector.py > ../data/data_collector_out.txt 2>&1 &
nohup
let's you run a process in the background, so if you close the terminal the process will keep on running.- Make sure you take a note of the process id which the
nohup
command will return. You can also find the proccess id fromps aux | grep data_collector
though. - To kill the process:
kill {process id}
for example:kill 472
- The
-u
flags tells Python to flush print immediately instead of storing it in the buffer. >
redirect output to the specified files and2>&1
redirects both standard out and error to the same file.- Finally do not forget to put
&
and the very end, this will tell the terminal shell that after executingnohup
you want to exit it.
Data are collected in text files and redis. The text files contain:
- data/npm_advisories (csv & json) both contain reported advisories from GitHub Advisory database.
- data/vulnerabilities.csv contain usage of the libraries found in the advisory database, they may or may not use vulnerable version of the library.
We have used TOR to maks IP, read more about the process here.