Data 512 Assignment 1: Data Curation
University of Washington DATA 512
Instructors Morgan and Keyes
To acquire, process, and analyze data using open research best practices, I was tasked with obtaining site traffic data from Wikimedia APIs and delivering the resulting data and graph files in this repository.
The source data is CC0 1.0 licensed, and usage should follow the mediawiki terms and conditions of use. The code in this repository is MIT licensed.
The Legacy Pagecounts API provides access to desktop and mobile traffic data from December 2007 through July 2016. The Pageviews API provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.
The data files in this repository are:
File | Description |
---|---|
README.md | Overview of project. Start here! |
LICENSE | MIT to cover the files in this repo |
A1 Data Curation.IPYNB | Jupyter Notebook containing code and steps to execute this project. |
pagecounts_desktop-site_200807-201607.JSON | Response data from Pagecounts for desktop through July 2016 |
pagecounts_mobile-site_200807-201607.JSON | Response data from Pagecounts for mobile through July 2016 |
pageviews_desktop_200807-201809.JSON | Response data from Pageviews for desktop through Sep 2018 |
pageviews_mobile-app_200807-201809.JSON | Response data from Pageviews for mobile app through Sep 2018 |
pageviews_mobile-web_200807-201809.JSON | Response data from Pageviews for mobile web through Sep 2018 |
en-wikipedia_traffic_200712-201809.CSV | Final output combined data used for plotting |
en-wikipedia_traffic_200712-201809.PNG | Final output plot |
The schema of en-wikipedia_traffic_200712-2018.CSV is:
Column | Value |
---|---|
year | YYYY |
month | MM |
pagecount_all_views | num_views |
pagecount_desktop_views | num_views |
pagecount_mobile_views | num_views |
pageview_all_views | num_views |
pageview_desktop_views | num_views |
pageview_mobile_views | num_views |
To reproduce and expand upon this work:
- Git clone this repo.
- Run A1 Data Curation.IPYNB as a Python Jupyter Notebook.
Note that pagecount, the original traffic API, included hits from web crawlers and other automated traffic. The new Pageview API should exclude such hits.