The following project accomplishes two goals:
- Predicting the 2016 US election results by county with supervised machine learning in R.
- Mining interesting association rules that relate to demographics and voting preference in R.
Three supervised machine learning models are used to predict election results based on demographics: K-Nearest Neighbor, Decision Trees, and Artificial Neural Networks. The models are compared based on accuracy and precision.
-
data - contains three data sets used in analysis (taken from kaggle, referenced in the credits):
a. county_facts.csv - Demographic breakdown of each county.
b. county_facts_dictionary.csv - Dictionary to decode variable names in county_facts.csv.
c. pres16results.csv - Results of the 2016 election by county. -
images - contains vizualizations:
a. decision_tree.png - Decision tree created from modelling process.
b. model_comparison.png - Comparison of 3 classification models used.
c. population_trends.png - Population size by voting preference.
d. voting_trends.png - Voting trends by top 5 normalized demographics.
e. democrat_arules.png - Scatterplot of democratic association rules by support and confidence.
f. republican_arules.png - Scatterplot of republican association rules by support and confidence.
g. democrats_grid.png - Color grid of democratic association rules.
h. republican_grid.png - Color grid of republican association rules. -
classification - contains classification files that predict election outcome based off demographics:
a. classification.Rmd - R Markdown detailing the classification process, from data cleaning to model creation.
b. classification.pdf - PDF that shows R code and the outputted results, for easy viewing. -
association_rules - contains association rules files:
a. association_rules.Rmd - R Markdown to mine rules that relate to demographics and voting preference.
b. association_rules.pdf - PDF that shows R code and the outputted results, for easy viewing. -
results.pdf - A full write-up comparing classification and association rules mining in R vs SAS.
R is used for all model building - the results are compared in R vs SAS.
The following packages are used:
#list of packages used
packages <- c("dplyr", "tidyr", "ggplot2", "class", "rpart", "rpart.plot", "neuralnet", "arules",
"plyr", "mltools", "arulesViz", "plotly", "RCurl")
#check to see if package is already installed, if not, install
for(p in packages){
if(!require(p, character.only = TRUE)) {
install.packages(p)
library(p, character.only = TRUE)
}
}
- Would like to thank Ben Hammer for the county_facts.csv and county_facts_dictionary.csv datasets, which were taken off Kaggle.
- Would like to thank Steve Palley for the pres16results.csv dataset, which was taken off Kaggle.
MIT License Copyright (c) 2019 Ian Jeffries