devsearch-reporank

PageRank adaptation for github repositories

How it works

We use an adaptation of Google's PageRank where "User Stars Repo" and "Repo Has Contributors" relations are used in the random walker algorithm.

BigQuery and GithubArchive

There is a very huge number of "User Stars Repo" and "Repo Has Contributors" relations. They are almost unpossible to crawl from Github API with normal access tokens. Thus we fetch them using Github Archive that stores every Github Events since 01.01.2011. Among those event we are interested in Watch Events and Push Events.

Github Archive give access to their content through Google Big Query where we can perform large distributed queries.

PageRank Adaptation

Each step consist of 2 steps, spreading User notoriety to Repository and then spreading Repository notoriety to Users.

Usage

BigQuery and GithubArchive

Go to BigQuery
Run each of the following queries (2 different queries for before and after 2015 and another one for merging)
Store the result in a Bucket on Google Cloud Stolrage
Download the data localy in csv

Whatch Events

For events before 01.01.2015: (save to githubstars.stars_11_14)

SELECT actor_attributes_login as login, REGEXP_EXTRACT(repository_url,r'\/([\w\d_\-\.]+\/[\w\d_\-\.]+)$') as repo, COUNT(*) as nb  FROM
  TABLE_QUERY([githubarchive:month],
    'REGEXP_MATCH(table_id, r"^201[1|2|3|4]\d+")'
  )
WHERE
  type = "WatchEvent"
GROUP EACH BY
  login, repo

For events since 01.01.2015: (save to githubstars.stars_15)

SELECT
  actor_login as login, repo_name as repo, COUNT(*) as nb  FROM
TABLE_QUERY
  ([githubarchive:month],
    'REGEXP_MATCH(table_id, r"^2015\d\d")'
  )
WHERE
  type = "WatchEvent"
GROUP EACH BY
  login, repo

Merging both:

SELECT
  login, repo, SUM(nb)
FROM
  githubstars.stars_11_14, githubstars.stars_15
GROUP EACH BY
  login, repo

Push Events

For events before 01.01.2015: (save to githubcontribs.contribs_11_14)

SELECT actor as login, REGEXP_EXTRACT(repository_url,r'\/([\w\d_\-\.]+\/[\w\d_\-\.]+)$') as repo, COUNT(*) as nb  FROM
  TABLE_QUERY([githubarchive:month],
    'REGEXP_MATCH(table_id, r"^201[1234]\d\d$")'
  )
WHERE
  type = "PushEvent"
GROUP EACH BY
  login, repo

For events since 01.01.2015: (save to githubcontribs.contribs_15)

SELECT
  actor_login as login, repo_name as repo, COUNT(*) as nb  FROM
TABLE_QUERY
  ([githubarchive:month],
    'REGEXP_MATCH(table_id, r"^2015\d\d")'
  )
WHERE
  type = "PushEvent"
GROUP EACH BY
  login, repo;

Merging the two:

SELECT
  login, repo, SUM(nb)
FROM
  githubcontribs.contribs_11_14, githubcontribs.contribs_15
GROUP EACH BY
  login, repo

Spark

NodeJs crawler

This is a prototype of a crawler for githubAPI. It is working but is restrained by API limits (5k calls per day) To run it you just need to

Have node.js installed
Replace the placeholders with your API token in crawler.js
Run node star-crawler/main.js

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src/main/scala/pagerank		src/main/scala/pagerank
star-crawler		star-crawler
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
build.sbt		build.sbt
update_scaladoc.sh		update_scaladoc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

devsearch-reporank

How it works

BigQuery and GithubArchive

PageRank Adaptation

Usage

BigQuery and GithubArchive

Whatch Events

Push Events

Spark

NodeJs crawler

About

Releases

Packages

Contributors 3

Languages

devsearch-epfl/devsearch-reporank

Folders and files

Latest commit

History

Repository files navigation

devsearch-reporank

How it works

BigQuery and GithubArchive

PageRank Adaptation

Usage

BigQuery and GithubArchive

Whatch Events

Push Events

Spark

NodeJs crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages