This repository contains the code and data used to analyze the gender breakdown of owners of public GitHub repositories. I wrote a blog post about what I found.
To reproduce the analysis, run scripts in the following order:
get_github_info_byday.py
: uses the GitHub API to scrape repository data. (nb: This will take something like 60 hours to run).merge_files.sh
: puts all scraped data together in a big text filemake_database.R
: dumps the scraped data into a SQLite databaseanalyze_data.py
: processes the databargraph.js
: JavaScript/D3 code used to make the graphic showing the results. Alex Wilson made major contributions to this code.
The data I scraped in get_github_info_byday.py
and processed with merge_files.sh
and make_database.R
is available in a .db file here. I removed all repo owner last names.
Python libraries: PyGithub, Unidecode, Pandas, SexMachine, Matplotlib. Make sure these are installed before running scripts. See requirements.txt for a more detailed specification of Python dependencies, including versions.
R packages: devtools, proto, DBI, chron, RSQLite, and RSQLITE.extfuns. All can be installed from CRAN in R using the install.packages
function.