Skip to content

Latest commit

 

History

History

scripts

Python Scripts

These Python scripts use the GitHub APIs to gather data.

Acceptable Use

Note: Some of these scripts gather names and email addresses, which we use to help us find a contact if we have questions about a repository or org. Note that the GitHub Acceptable Use Policies prohibits certain usage of information, and I would encourage you to read this policy and not use scripts like these for unethical purposes.

Requirements:

The scripts all have a few common requirements, and individual scripts may have additional requirements and other information which can be found in the Docstrings.

  • These scripts require that pandas be installed within the Python environment you are running this script in.
  • Your API key should be stored in a file called gh_key in the same folder as these scripts.
  • Most scripts also require an orgs.txt or other text file used as input. Details can be found in the docstring for each script.
  • Most scripts require that a folder named "output" exists in this scripts directory, and csv output files will be stored there.

Scripts

Inclusivity Check

This script uses the GitHub GraphQL API to retrieve default branch name and code of conduct for each repo in a GitHub org for a very quick, but rudimentary inclusivity check.

Running the script

Requires orgs.txt

$python3 inclusivity_check.py

Repository Activity

These scripts demonstrate the difference in speed and rate limits between the GitHub REST API and the GraphQL API. The original REST script took hours to run across our 60+ GitHub orgs and had to be slowed down to avoid hitting the rate limit, while the GraphQL version, which gathers the same data, runs in less than 15 minutes without hitting any rate limits.

scripts/repo_activity.py
scripts/repo_activity_coc.py
scripts/repo_activity_REST.py

We used this script to gather basic data about the repositories found across dozens of an organization's GitHub orgs. We use this to understand whether projects are meeting our compliance requirements. We also use this script to find abandoned repos that have outlived their usefulness and should be archived.

Note: repo_activity_coc.py is mostly identical to repo_activity.py, but it adds info about the code of conduct. This is a separate script because the codeOfConduct object in the GraphQL API is a bit problematic and tends to time out when getting relatively small amounts of data.

Running the scripts

Requires orgs.txt

$python3 repo_activity.py

Sunset

This script uses the GitHub GraphQL API to Gather data to determine whether a repo can be archived. It retrieves relevant information about a repository, including forks to determine ownership and possibly contact people to understand how they are using a project.

As input, this script requires a GitHub URL for a repository or a csv file containing one repo_name,org_name pair per line.

Running the script

Run the script with one repo url as input

$python3 sunset.py -u "https://github.com/org_name/repo_name"

Run the script with a csv file containing one repo_name,org_name pair per line:

$python3 sunset.py -f sunset.csv

Monitoring

This script uses the GitHub GraphQL API to retrieve the pinned repos for each GitHub org listed in monitoring.txt and runs criticality score for each of those pinned repositories.

Running the script

Requires monitoring.txt

$python3 monitoring.py

Keyword by Repo with Optional Filter

The keyword_by_repo script uses the GitHub GraphQL API to retrieve relevant information about repositories mentioning certain keywords.

The filter_keyword_by_org script uses the results from a keyword search and filters it based on a list of GitHub organizations.

As input, this script requires a file generated by the keyword_by_repo.py script. This is provided via a command line argument.

Running the scripts

keyword_by_repo requires keywords.txt

$python3 keyword_by_repo.py

filter_keyword_by_org requires output file from keyword_by_repo

$python3 filter_keyword_by_org.py /path/to/keyword_search_2022-07-26.csv

Mystery GitHub Organizations

We can use this script to gather basic data about GitHub orgs that we believe may have been created outside of our process by various employees across our business units. We gather the first few members of the org to help identify employees who can provide more details about the purpose of the org and how it is used.

scripts/mystery_orgs.py

However, since members are private by default, this script may not be as useful as just running repo_activity.py on those same orgs to also learn more about the repos and get better contact info from the commit data.

Running the script

Requires orgs.txt

$python3 mystery_orgs.py