DS4001_Final_Project_Grp8.Rmd

---
title: "What Makes a Happy Country?"
author: "Abby Newbury, Shivani Das, Doug Schwartz, and Cam Bailey"
date: "12/8/2020"
output:
  html_document:
    toc: TRUE
    theme: spacelab
    toc_float: TRUE
    toc_collapsed: TRUE
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE, echo = FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, error = FALSE, message = FALSE, cache = FALSE)
```


## I. Background and Data Processing 

As the World Happiness Report’s [website](https://worldhappiness.report/ed/2020/) states “The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. The World Happiness Report 2020 for the first time ranks cities around the world by their subjective well-being and digs more deeply into how the social, urban and natural environments combine to affect our happiness.”

The World Happiness Report is a publication of the Sustainable Development Solutions Network. The happiness scores and rankings use data from the Gallup World Poll, and the scores are based on answers to the main life evaluation question asked in a poll. This report was written by independent experts and does not necessarily reflect the views of the United Nations. The 156 observations in the data represent different countries in each row.

### I.A Research Question and Background{.tabset}

#### I.A.1 Research Question 
The research question that we would like to examine is what are the crucial determinants of happiness in a country? In order to answer this question, we plan to explore the information that is recorded for the 26 variables for each country in the columns are: the country’s name, year, life ladder, Log GDP per capita, social support, health life expectancy in birth, freedom to make life choices, generosity, perceptions of corruption, positive affect, negative affect, confidence in national government, democratic quality, delivery quality, standard deviation of ladder by country-year, Standard deviation/Mean of ladder by country-year, GINI index (World Bank estimate), GINI index (World Bank estimate) average 2000-2017 unbalanced panel, GINI of household income reported in Gallup by wp5-year, Most people can be trusted, Most people can be trusted WVS round 1981-1984, Most people can be trusted WVS round 1989-1993, Most people can be trusted WVS round 1994-98, Most people can be trusted WVS round 1999-2004, Most people can be trusted WVS round 2005-2009, and Most people can be trusted WVS round 2010-2014. These represent an estimate of the extent/contribution of each factor on a country’s happiness score. 

#### I.A.2 Relevant Literature 
Additionally, we wanted to examine previous literature that also explores country happiness. For instance, this [website](http://www.gnhcentrebhutan.org/what-is-gnh/gnh-happiness-index/)  about the GNH Happiness Index used in Bhutan (GNH) describes how the GNH index is a holistic approach to measure the happiness and wellbeing of the Bhutanese population. The GNH index is a measurement tool used for policy making to increase GNH. It includes the nine domains which are further supported by the 33 indicators. The Index analyzes the nation’s wellbeing with each person’s achievements in each indicator. In addition to analyzing the happiness and wellbeing of the people, it also guides how policies may be designed to further create enabling conditions for the weaker scoring results of the survey.

The New York Times wrote an interesting article about the results from the 2020 World Happiness Report with special consideration of the ongoing COVID-19 pandemic. The article says that happiness isn’t a function of how well positive emotions are expressed, but rather, it’s a measure of general satisfaction with life, and the confidence in a living a secure life according to John F. Helliwell, an editor of the annual happiness report. Happy people “wouldn’t have the highest smile factor,” he said. “They do trust each other and care about each other, and that’s what fundamentally makes for a better life.” - [NYT](https://www.nytimes.com/2020/03/20/world/europe/world-happiness-report.html).


```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
library(rio)
library(plyr)
library(dplyr)
library(tidyverse)
library(rpart)
library(psych)
library(pROC)
library(rpart.plot)
library(rattle)
library(caret)
library(knitr)
library(tibble)
library(tidyverse)
library(tidytext)
library(knitr)
library(XLConnect)
library(dplyr)
library(countrycode)
library(zoo)
library(gtools)
library(NbClust)
library(e1071)
library(class)
library(plotly)
library(ggplot2)
```


```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
#Read in and combine all sheets of workbook
happy_data_all <- readWorksheetFromFile("Happiness_Index_Data.xls", sheet = 1)
happy_data_all <- happy_data_all %>%
  mutate(country_year = paste(Country, as.character(year), sep = "_"))
happy_index_all <- readWorksheetFromFile("Happiness_Index_Data.xls", sheet = 2)

# 2017 data before cleaning
data.2017.preclean <- happy_data_all[which(happy_data_all$year=="2017"),]

country_count.2004 <- sum(happy_data_all$year == "2004")
country_count.2005 <- sum(happy_data_all$year == "2005")
country_count.2006 <- sum(happy_data_all$year == "2006")
country_count.2007 <- sum(happy_data_all$year == "2007")
country_count.2008 <- sum(happy_data_all$year == "2008")
country_count.2009 <- sum(happy_data_all$year == "2009")
country_count.2010 <- sum(happy_data_all$year == "2010")
country_count.2011 <- sum(happy_data_all$year == "2011")
country_count.2012 <- sum(happy_data_all$year == "2012")
country_count.2013 <- sum(happy_data_all$year == "2013")
country_count.2014 <- sum(happy_data_all$year == "2014")
country_count.2015 <- sum(happy_data_all$year == "2015")
country_count.2016 <- sum(happy_data_all$year == "2016")
country_count.2017 <- sum(happy_data_all$year == "2017")
country_count.2018 <- sum(happy_data_all$year == "2018")
country_count.2019 <- sum(happy_data_all$year == "2019")

happy_reader <- function(num_x) {
  new_sheet <- readWorksheetFromFile("Happiness_Index_Data.xls", sheet = num_x)
  happy_index_all <- merge(happy_index_all, new_sheet, by.x = "Country",
             by.y = "Country", all.x = TRUE, all.y = TRUE)
}
happy_index_all <- happy_reader(3)
happy_index_all <- happy_reader(4)
happy_index_all <- happy_reader(5)
happy_index_all <- happy_reader(6)
happy_index_all <- happy_reader(7)
happy_index_all <- happy_index_all %>%
  gather(key   = yr, value = HI, -Country) %>%
  mutate(year = as.integer(gsub("HH_", "", yr, fixed = TRUE)) -1) %>%
  mutate(country_year = paste(Country, as.character(year), sep = "_")) %>%
  select(-yr)
happy_data_all <-merge(happy_data_all, happy_index_all, by.x = "country_year",
        by.y = "country_year", all.x = FALSE, all.y = TRUE)
happy_data_all <- happy_data_all %>%
  separate(country_year, into = c('Country','year'), sep="_") %>%
  select(-Country.x, -Country.y, -year.x, -year.y) %>%
  mutate(year = as.integer(year))
#Only keep independent variables that are at least 70% populated
colSums(is.na(happy_data_all)) / nrow(happy_data_all)

# removing variables less than 70% populated
happy_data <-  happy_data_all %>% select(HI, Country, year, Life.Ladder, Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption, Positive.affect, Negative.affect, Confidence.in.national.government, Democratic.Quality, Delivery.Quality)

# removing rows where HI is NA
happy_data <- happy_data[complete.cases(happy_data[ , 1]),]
happy_data <- happy_data[complete.cases(happy_data),]
happy_data$quartile <- quantcut(happy_data$HI, q=4, na.rm=TRUE)
happy_data$HIQ <- factor(happy_data$quartile,labels = c("1", "2", "3", "4"))

# 2017 data after cleaning
data.2017.cleaned <- happy_data_all[which(happy_data_all$year=="2017"),]
```


## II. Exploratory Data Analysis 

### II.A Initial Summary Statistics

#### II.A.1 Count of Countries per Year
Before cleaning the data, the amount of countries listed in this report varies year by year, from 2005 to 2019. 2005 has the least amount of participating countries with 27, and 2017 the most with 147. After cleaning the data for variables less than 70% populated and from the years 2014-2019, 117 countries are left for analysis for each of the 5 years.

```{r, echo=FALSE}
country_count <- data.frame("Year" = c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014", "2015","2016","2017","2018", "2019"), "Count_of_Countries" = c(country_count.2005, country_count.2006, country_count.2007, country_count.2008, country_count.2009, country_count.2010, country_count.2011, country_count.2012, country_count.2013, country_count.2014, country_count.2015, country_count.2016, country_count.2017, country_count.2018, country_count.2019))

country_count.table <- kable(country_count, format = "simple", caption = "Count of Countries by Year")
country_count.table
```

#### II.A.2 Distribution of Happiness Values
One reason we are able to move forward with our method of cleaning the data (explained in II.B) is that the remaining 117 countries are a representative sample of the population. This can be seen by the Happiness Index distribution on the histograms below, before and after cleaning.
```{r}
par(mfrow=c(2,1))

hist.2017.preclean <- hist(data.2017.preclean$Life.Ladder, main = "Histogram of Country Happiness in 2017: Before Cleaning", xlab = "Happiness Index", col = "darkmagenta")

hist.2017.cleaned <- hist(data.2017.cleaned$Life.Ladder, main = "Histogram of Country Happiness in 2017: After Cleaning", xlab = "Happiness Index", col = "blue")
```

#### II.A.3 Average Happiness Over Time
```{r, echo=FALSE}
hi.2014 <- happy_index_all[which(happy_index_all$year=="2014"),]
hi.2015 <- happy_index_all[which(happy_index_all$year=="2015"),]
hi.2016 <- happy_index_all[which(happy_index_all$year=="2016"),]
hi.2017 <- happy_index_all[which(happy_index_all$year=="2017"),]
hi.2018 <- happy_index_all[which(happy_index_all$year=="2018"),]
hi.2019 <- happy_index_all[which(happy_index_all$year=="2019"),]

avg_HI <- data.frame("Year" = 2014:2019, "Average" = c(mean(hi.2014$HI, na.rm = TRUE), mean(hi.2015$HI, na.rm = TRUE), mean(hi.2016$HI, na.rm = TRUE), mean(hi.2017$HI, na.rm = TRUE), mean(hi.2018$HI, na.rm = TRUE), mean(hi.2019$HI, na.rm = TRUE)), "Afghanistan" = happy_index_all[which(happy_index_all$Country=="Afghanistan"),2], "Brazil" = happy_index_all[which(happy_index_all$Country=="Brazil"),2], "China" = happy_index_all[which(happy_index_all$Country=="China"),2], "Germany" = happy_index_all[which(happy_index_all$Country=="Germany"),2], "Jamaica" = happy_index_all[which(happy_index_all$Country=="Jamaica"),2], "Libya" = happy_index_all[which(happy_index_all$Country=="Libya"),2], "New_Zealand" = happy_index_all[which(happy_index_all$Country=="New Zealand"),2], "Philippines" = happy_index_all[which(happy_index_all$Country=="Philippines"),2], "Somalia" = happy_index_all[which(happy_index_all$Country=="Somalia"),2], "United_States" = happy_index_all[which(happy_index_all$Country=="United States"),2])


HI_over_time <- plot_ly(avg_HI, x = ~Year, y = ~Average, name = 'Average', type = 'scatter', mode = 'lines') %>% 
  add_trace(y = ~Afghanistan, name = 'Afghanistan', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Brazil, name = 'Brazil', type = 'scatter', mode = 'lines') %>% add_trace(y = ~China, name = 'China', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Germany, name = 'Germany', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Jamaica, name = 'Jamaica', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Libya, name = 'Libya', type = 'scatter', mode = 'lines') %>% add_trace(y = ~New_Zealand, name = 'New Zealand', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Philippines, name = 'Philippines', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Somalia, name = 'Somalia', type = 'scatter', mode = 'lines') %>% add_trace(y = ~United_States, name = 'United States', type = 'scatter', mode = 'lines')  

x_axis <- seq(2014, 2019, by = 1)

HI_over_time <- HI_over_time %>% 
  layout(
    title = "Happiness Index over Time",
    xaxis = list(title="Year"),
    yaxis = list(title="Happiness Index")
)
HI_over_time

```


### II.B Justify Data Processing Decisions

We collected all six years of data from the Happiness Dataset from 2015 to 2020,  and then assigned the outcome variables to match up to the year in the data for which the outcome variables were determined in the original dataset. For processing and cleaning the data, we first read in and combined all sheets from the workbook. Next, we decided to only keep the independent variables that were at least 70% populated and removed the rest. Finally, we decided to remove rows where the Happiness Index (HI) was NA or missing and drop any observations that didn't have all the variables populated for the decision tree models since we didn't want to impute missing values. After cleaning and processing all the data, we were still left with 615  observations from the original amount of 1026 observations 


#### II.B.1 Data Characteristics

The type of data is a sample, as not all of the countries are present in the data set. A poll was implemented to gather data on the variables of interest. This data from the Gallup World Poll was used to determine the influence of each to calculate the happiness score 1 and rank for each country. While the data set does not mention how the specific countries were selected to be put in the data set, we can see that there are observations for all of the larger and more prominent countries of the world. We see that the countries that tend to be missing are the smaller ones, where possibly polling was simply not conducted or polling was not deemed to be suitable. Looking at the data set, the qualitative variables in the data set are the name and region of each country. All of the other variables are quantitative.
  
#### II.B.2 Data Issues

There are a couple of issues with the data set. The first issue is that this data set only has 157 countries, while there are 195 countries recognized by the United Nations. This would impact the statistical calculations and graphics made. The various statistics, such as average and median, most likely would change with all of these countries being represented. Additionally, graphs generated representing the data would change with more countries being present. The countries with the lowest happiness score could change, the boxplots representing each region could change, and trends would be more thoroughly seen if all of the countries were present. Another potential issue is that polls were conducted to determine the values for each of the factors in respect to the happiness score. We don’t know how these surveys were conducted in each country, if countries took it seriously or didn’t, if polling was consistent across the countries, and if the answers from these polls are entirely representative of the country’s entire population. This would lead to inaccurate representation in the data. 

#### II.B.3 Correlation Matrix

We've created a correlation matrix between all of the numeric variables such as Happiness Index (HI), year, Life Ladder, Log GDP per capita, social support, health life expectancy at birth, freedom to make life choices, generosity, perceptions of corruption, and positive/negative affect. The matrix tells us how each variable is correlated with respect to every other quantitative factor and the strength of that correlation. For instance, HI has a strongly positive correlation with Life Ladder. Similarly, perceptions of corruption are negatively correlated with happiness index as expected.

```{r}
#Correlation Matrix
happy <- happy_data
happy.numeric <- happy[,sapply(happy,is.numeric)]
matrix <- cor(happy.numeric, method="pearson")
knitr::kable(round(matrix,2))
```


## III. Clustering Analysis

In analyzing our data, we have defined the variable of interest to be the quartile in which the happiness index falls for a given country and year. Accordingly the base rate would be 25% of our observations for each quartile, meaning the probability that we correctly assign a country-year observation to the correct quartile (by random chance) is 0.25, by construction. In this section, we instead try to cluster the data using our explanatory variables to group together countries with similar characteristics. Using both the elbow method and the results of the NbClust function, we’ll determine the optimal number of clusters into which the data will be grouped. Once countries with similar characteristics have been sorted into this optimal number of groups, we will examine the happiness index scores associated with each grouping with the expectation that countries within a grouping would likely have similar happiness index scores.


### III.A. Clustering k-means

In order to apply the k-means algorithm, the data required a few modifications. First, we removed the variable of interest from the data (both the factorized quartile and raw happiness index measure). The goal of this exercise is to group together countries with similar characteristics under the hypothesis that these similarities would imply similar measures of happiness. Accordingly, Country was also removed as we don't want that factor variable to provide explanatory power (lest our advice for an unhappy country be to try not being that country). The other factor variable in our dataset, the year of the observation, does seem relevant as global events in a particular year can certainly explain shifts in a country’s happiness, e.g. a global pandemic. Rather than utilizing dummy coding as we would for nominal factor variables, we instead allowed year to be treated as a numeric variable and applied the same standardization as the rest of the variables. Having created a final data set consisting only of numeric variables, we then standardized the entire dataset. 

```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
happy_data_num <- happy_data[,sapply(happy_data, is.numeric)]
happy_data_num <- happy_data_num %>%
  select(-HI, -year)
happy_data_num <- as.data.frame(scale(happy_data_num))
```

#### III.A.1 Optimal Number of Clusters{.tabset}

Applying the k-means function to assign observations to as few as one or up to ten different clusters, we seek to identify the number of clusters that will maximize the inter-cluster variance, (i.e. the sum of the distances between points from different clusters) subject to the constraint of minimizing the intra-cluster variance (the sum of the distances between points within the same cluster). 

```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
explained_variance = function(data_in, k){
  set.seed(1)
  kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 50)
  var_exp = kmeans_obj$betweenss / kmeans_obj$totss
}
explained_var_happy = sapply(1:10, explained_variance, data_in = happy_data_num)
elbow_data_happy = data.frame(k = 1:10, explained_var_happy)
```

##### III.A.1.1 Elbow Method

The elbow graph shown below plots the number of clusters used against the fraction of the total variance accounted for by the designated number of clusters. The latter refers to the ratio of inter-cluster variance relative to the total variance in the data (i.e. the sum of the distances between all the points in the data set). The ‘elbow’ seems to indicate that the point at which the increased inter-cluster variance relative to the smaller intra-cluster variance would be around 3 clusters.

```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(elbow_data_happy,
       aes(x = k,  
           y = explained_var_happy)) +
  geom_point(size = 4) +
  geom_line(size = 1) +            
  xlab('Number of Clusters') +
  ylab('Inter-cluster Variance / Total Variance') +
  theme_light()
```

##### III.A.1.2 NbClust Majority Rule

Using the NbClust function we find that the optimal number of clusters is 3; as shown on the graph below, the majority of models (12) voted for 3 clusters. This reaffirms our conclusion drawn from our initial inspection of the elbow chart.

```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
nbclust_happy = NbClust(data = happy_data_num, method = "kmeans")
freq_k_happy = nbclust_happy$Best.nc[1,]
freq_k_happy = data.frame(freq_k_happy)
```

```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(freq_k_happy,
       aes(x = freq_k_happy)) +
  geom_bar() +
  scale_x_continuous(breaks = seq(0, max(freq_k_happy), by = 1)) +
  scale_y_continuous(breaks = seq(0, 12, by = 1)) +
  labs(x = "Number of Clusters",
       y = "Number of Votes",
       title = "Cluster Analysis")
```

#### III.A.2 Assigning Optimal Number of Clusters{.tabset}

Using the recommended number of clusters, we find that 3 clusters explains 44% of the total variance. Assigning the predicted clusterings to the actual data we can then visualize the output to show that our model does extremely well at assigning countries to clusters reflecting overall happiness. 

```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
kmeans_obj_happy = kmeans(happy_data_num, centers = 3, algorithm = "Lloyd", iter.max = 50)
(var_exp_k_happy = kmeans_obj_happy$betweenss / kmeans_obj_happy$totss)
Final_Happy_Clusters <- happy_data
Final_Happy_Clusters$Cluster <- as.factor(kmeans_obj_happy$cluster)
```

As the graphs below illustrate, our clusters do quite well at identifying countries on the lower and higher end of the happiness index range. However, as mentioned earlier, the self-reported happiness variable (Life Ladder) appears to be too tightly correlated with our dependent variable. In the following subsection we will explore what the results of our clustering analysis would be without this explanatory variable.

#####  Life Ladder
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = HI,
                         y = Life.Ladder,
                         color = Cluster,
                         shape = HIQ)) +
  geom_point(aes(color = Cluster), size = 2) +
  ggtitle("Life Ladder and Happiness Index (HI) Clusters") +
  xlab("Happiness Index (HI)") +
  ylab("Life Ladder") +
  scale_shape_manual(name = "HI Quartile",
                     labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
                     values = c("1", "2","3", "4")) +
  scale_color_manual(name = "Cluster",
                     labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
                     values = c("blue", "grey", "red")) +
  theme_light()
```

#####  GDP-per-Capita
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = HI,
                         y = Log.GDP.per.capita,
                         color = Cluster,
                         shape = HIQ)) +
  geom_point(aes(color = Cluster), size = 2) +
  ggtitle("Log GDP per Capita and Happiness Index (HI) Clusters") +
  xlab("Happiness Index (HI)") +
  ylab("Log GDP per Capita") +
  scale_shape_manual(name = "HI Quartile",
                     labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
                     values = c("1", "2","3", "4")) +
  scale_color_manual(name = "Cluster",
                     labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
                     values = c("blue", "grey", "red")) +
  theme_light()
```

#####  Life Expectancy
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = HI,
                         y = Healthy.life.expectancy.at.birth ,
                         color = Cluster,
                         shape = HIQ)) +
  geom_point(aes(color = Cluster), size = 2) +
  ggtitle("Life Expectancy and Happiness Index (HI) Clusters") +
  xlab("Happiness Index (HI)") +
  ylab("Healthy Life Expectancy at Birth") +
  scale_shape_manual(name = "HI Quartile",
                     labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
                     values = c("1", "2","3", "4")) +
  scale_color_manual(name = "Cluster",
                     labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
                     values = c("blue", "grey", "red")) +
  theme_light()
```

### III.B. Revised Clustering Analysis

As discussed in the section above, the self-reported happiness variable (Life Ladder) is very highly correlated with the happiness index, our dependent variable. In this section we remove the variable from our data and redo our clustering analysis. The goal of this analysis is to identify countries with happiness index scores that diverge from what our expectations would be based upon all the other factors (the growth in GDP, trust in government, life expectancy, etc.).


#### III.B.1 Removing Life Ladder{.tabset}

```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
happy_data_num_b <- happy_data_num %>%  select(-Life.Ladder)
explained_var_happy_b = sapply(1:10, explained_variance, data_in = happy_data_num_b)
elbow_data_happy_b = data.frame(k = 1:10, explained_var_happy_b)
```
As before, we now re-apply the k-means function to assign observations to as few as one or up to ten different clusters, seeking to identify the number of clusters that will maximize the inter-cluster variance and minimize the intra-cluster variance.

##### III.B.1.1 Elbow Method

The elbow graph shown below plots the number of clusters used against the fraction of the total variance accounted for by the designated number of clusters. The ‘elbow’ seems to indicate that the point at which the increased inter-cluster variance relative to the smaller intra-cluster variance would still be around 3 clusters.

```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(elbow_data_happy_b,
       aes(x = k,  
           y = explained_var_happy_b)) +
  geom_point(size = 4) +
  geom_line(size = 1) +            
  xlab('Number of Clusters') +
  ylab('Inter-cluster Variance / Total Variance') +
  theme_light()
```

##### III.B.1.2 NbClust Majority Rule

Using the NbClust function we find that the optimal number of clusters is 3; as shown on the graph below, the majority of models (12) voted for 3 clusters. This reaffirms our conclusion drawn from our initial inspection of the elbow chart and is consistent with our findings prior to removing Life Ladder from our data.

```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
nbclust_happy_b = NbClust(data = happy_data_num_b, method = "kmeans")
freq_k_happy_b = nbclust_happy_b$Best.nc[1,]
freq_k_happy_b = data.frame(freq_k_happy_b)
```

```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(freq_k_happy_b,
       aes(x = freq_k_happy_b)) +
  geom_bar() +
  scale_x_continuous(breaks = seq(0, max(freq_k_happy_b), by = 1)) +
  scale_y_continuous(breaks = seq(0, 12, by = 1)) +
  labs(x = "Number of Clusters",
       y = "Number of Votes",
       title = "Cluster Analysis")
```

#### III.B.2 Assigning Optimal Number of Clusters{.tabset}

Using the recommended number of clusters on the revised data set, we find that 3 clusters explains 42.6% of the variance which is only slightly less than the 44% of variance accounted for when applying the same number of clusters and including the variable Life Ladder. Assigning the predicted clusters to the actual data we can then visualize the output to show that our model still does quite well at assigning countries to clusters reflecting overall happiness.

```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
kmeans_obj_happy_b = kmeans(happy_data_num_b, centers = 3, algorithm = "Lloyd", iter.max = 50)
(var_exp_k_happy_b = kmeans_obj_happy_b$betweenss / kmeans_obj_happy_b$totss)
Final_Happy_Clusters$Cluster_b <- as.factor(kmeans_obj_happy_b$cluster)
```

As before, we now plot our revised clusters against the observed happiness index and variables of interest. Note that even without training the clusters using the self-reported variable Life Ladder, our clusters still do well at identifying countries on the high and low end of the happiness index spectrum.

#####  Life Ladder

```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = HI,
                         y = Life.Ladder,
                         color = Cluster_b,
                         shape = HIQ)) +
  geom_point(aes(color = Cluster_b), size = 2) +
  ggtitle("Life Ladder and Happiness Index (HI) Clusters") +
  xlab("Happiness Index (HI)") +
  ylab("Life Ladder") +
  scale_shape_manual(name = "HI Quartile",
                     labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
                     values = c("1", "2","3", "4")) +
  scale_color_manual(name = "Cluster",
                     labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
                     values = c("blue", "grey", "red")) +
  theme_light()
```

#####  GDP-per-Capita
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = HI,
                         y = Log.GDP.per.capita,
                         color = Cluster_b,
                         shape = HIQ)) +
  geom_point(aes(color = Cluster_b), size = 2) +
  ggtitle("Log GDP per Capita and Happiness Index (HI) Clusters") +
  xlab("Happiness Index (HI)") +
  ylab("Log GDP per Capita") +
  scale_shape_manual(name = "HI Quartile",
                     labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
                     values = c("1", "2","3", "4")) +
  scale_color_manual(name = "Cluster",
                     labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
                     values = c("blue", "grey", "red")) +
  theme_light()
```

#####  Life Expectancy
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = HI,
                         y = Healthy.life.expectancy.at.birth ,
                         color = Cluster_b,
                         shape = HIQ)) +
  geom_point(aes(color = Cluster_b), size = 2) +
  ggtitle("Life Expectancy and Happiness Index (HI) Clusters") +
  xlab("Happiness Index (HI)") +
  ylab("Healthy Life Expectancy at Birth") +
  scale_shape_manual(name = "HI Quartile",
                     labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
                     values = c("1", "2","3", "4")) +
  scale_color_manual(name = "Cluster",
                     labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
                     values = c("blue", "grey", "red")) +
  theme_light()
```

### III.C Evaluating Clusters

Having chosen the optimal number of clusters and set of explanatory variables, we can assess the results of the k-means clustering analysis using two approaches. First, we will compare the distribution of our clusters against the initially designated quartiles of the happiness index distribution. Then we will examine a series of visualizations of the data to glean insights into our clusters.


```{r,  include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
Final_Happy_Clusters$Cl_b <- factor(Final_Happy_Clusters$Cluster_b, labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"))
comp_clust_HI = table(as_factor(Final_Happy_Clusters$Cl_b), Final_Happy_Clusters$HIQ)
```


#### III.C.1 Pseudo-Confusion Matrix 

The following table compares the actual happiness quartile assignments and the cluster assigned by our model. Note that no countries falling in the lowest quartile of the HI were assigned to the cluster associated with happier countries and vice versa. The majority of observations in both the happier and unhappier clusters are concentrated in the 1st and 4th quartiles, respectively. This is a good indication that there are strong similarities across the characteristics of countries that are the happiest and least happy. Note that the cluster that spans all four quartiles of the happiness index does not necessarily indicate a good or bad fit, but rather reflects that by just choosing the quartiles of a continuous distribution, there may not be significant differences between a country whose happiness index falls near the threshold between one quartile and another.

```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
kable(comp_clust_HI)
```


#### III.C.2 Visualizing Final Clusters{.tabset}


Having seen that our clusters fit well when plotted against the happiness index itself, we can begin to explore how particular variables influenced a given cluster assignment by plotting explanatory variables and contrasting the assigned cluster and observed happiness index. The graphs below plot pairs of explanatory variables against our predicted happiness clustering (the shape) and the actual assigned happiness index (color scale).

##### Social Support
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = Healthy.life.expectancy.at.birth,
                         y = Social.support,
                         shape = Cluster_b)) +
  geom_point(aes(color = HI), size = 2) +
  ggtitle("Life Expectancy & Social Support by Happiness Index (HI) and Cluster") +
  xlab("Healthy Life Expectancy at Birth") +
  ylab("Social Support") +
  scale_shape_manual(name = "Cluster",
                     labels = c("Happy", "Average", "Unhappy"),
                     values = c(2, 20, 6)) + 
  scale_color_viridis_c() +
  theme_light()
```

##### GDP-per-Capita
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = Healthy.life.expectancy.at.birth,
                         y = Log.GDP.per.capita,
                         shape = Cluster_b)) +
  geom_point(aes(color = HI), size = 2) +
  ggtitle("Life Expectancy & GDP by Happiness Index (HI) and Cluster") +
  xlab("Healthy Life Expectancy at Birth") +
  ylab("Log GDP per Capita") +
  scale_shape_manual(name = "Cluster",
                     labels = c("Happy", "Average", "Unhappy"),
                     values = c(2, 20, 6)) + 
  scale_color_viridis_c() +
  theme_light()
```

##### Democratic Quality
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = Delivery.Quality,
                         y = Democratic.Quality,
                         shape = Cluster_b)) +
  geom_point(aes(color = HI), size = 2) +
  ggtitle("Delivery & Democratic Quality by Happiness Index (HI) and Cluster") +
  xlab("Delivery Quality") +
  ylab("Democratic Quality") +
  scale_shape_manual(name = "Cluster",
                     labels = c("Happy", "Average", "Unhappy"),
                     values = c(2, 20, 6)) + 
  scale_color_viridis_c() +
  theme_light()
```

##### Perceived Corruption
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = Perceptions.of.corruption,
                         y = Confidence.in.national.government,
                         shape = Cluster_b)) +
  geom_point(aes(color = HI), size = 2) +
  ggtitle("Corruption and Govt Confidence by Happiness Index (HI) and Cluster") +
  xlab("Perceptions of Corruption") +
  ylab("Confidence in National Government") +
  scale_shape_manual(name = "Cluster",
                     labels = c("Happy", "Average", "Unhappy"),
                     values = c(2, 20, 6)) + 
  scale_color_viridis_c() +
  theme_light()
```

##### Generosity
```{r,  include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
                     aes(x = Generosity,
                         y = Freedom.to.make.life.choices,
                         shape = Cluster_b)) +
  geom_point(aes(color = HI), size = 2) +
  ggtitle("Generosity and Freedom by Happiness Index (HI) and Cluster") +
  xlab("Generosity") +
  ylab("Freedom to Make Life Choices") +
  scale_shape_manual(name = "Cluster",
                     labels = c("Happy", "Average", "Unhappy"),
                     values = c(2, 20, 6)) + 
  scale_color_viridis_c() +
  theme_light()
```

### III.D Conclusion of k-means 

In sum, we implemented the k-means algorithm and determined that the variance in the data is best explained by grouping countries into three distinct clusters. This finding proved robust both when we included and removed the self-reported happiness rating, Life Ladder. Upon evaluating our model’s clustering assignments against the continuous happiness index, we found that there were significant similarities in characteristics across countries at the higher and lower end of the happiness index spectrum. The model also adequately identified countries lying toward the center of the happiness index distribution (i.e. the average or moderately happy countries). However, the broad range of characteristics (higher intra-cluster variance) of this cluster revealed that the k-means algorithm is not as well suited for distinguishing between the moderately unhappy and moderately happy countries.

## Decision Tree

```{r, include=FALSE}
library(rio)
library(plyr)
library(dplyr)
library(tidyverse)
library(rpart)
library(psych)
library(pROC)
#install.packages("rpart.plot")
library(rpart.plot)
#install.packages("rattle")
library(rattle)
library(caret)
library(knitr)
library(tibble)
```

The purpose of this decision tree is to classify each country into happiness quartiles based on variables such as life ladder, social support, and democratic quality. The decision tree model will first be built using default settings but then the threshold will be adjusted to optimize for both the highest and lowest quartiles, allowing us to glean insights into which factors contribute the most to countries happiness.

### Methods

The Happiness Index was changed into quartiles, where the quartile distributio is as follows:

```{r, echo=FALSE, eval=TRUE}
quantile(happy_data$HI)
happy_data1 <- happy_data %>% select(HIQ,year, Life.Ladder, Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption, Positive.affect, Negative.affect, Confidence.in.national.government, Democratic.Quality, Delivery.Quality)
```

Any rows with NA were removed from the data frame in order to perform the decision tree analysis.

##### Base rate calculation

The base rate for this classifier is the individual percentages for each quartile. Quartile 1 has a base rate of 25.04%, Quartile 2: 25.04%, Quartile 3: 24.9%, and Quartile 4: 25.04%. This base rate is as expected when distributing data into quartiles.

``` {r, include=FALSE}
#6  For the multi-class this will be the individual percentages for each class.
library(dplyr)
dplyr::summarize
detach("package:plyr", unload=TRUE) 
table(happy_data$HIQ)
# happy_data: happy_data
# HIQ_BR: HIQ_BR
# tumor_cnt: hiq_cnt
# tumor_br: hiq_br
HIQ_BR <- happy_data %>%
  group_by(HIQ) %>%
  summarize(hiq_cnt=n())
HIQ_BR$hiq_br <- (HIQ_BR$hiq_cnt / length(happy_data$HIQ))
(h_br <- (HIQ_BR$hiq_cnt / length(happy_data$HIQ)))
HIQ_BR
```

#### Build model using default settings

``` {r, include=FALSE}
#7 Build your model using the default settings
happy_data1 <- as.data.frame(happy_data1)
set.seed(1980)
happy_data1_gini_t = rpart(HIQ~.,  
  method = "class",  parms = list(split = "gini"),
  data = happy_data1)
```

The most important variable for the tree is Life.Ladder. The first split in the tree is created using this variable, as seen below, with one split of life ladder less than 5.5 and the other of life ladder greater than or equal to 5.5. Life ladder is where people rate their own lives on a 0 to 10 scale with 10 being the best possible life. Thus, it seems that the most importantly variable for a country's happiness is how its people rate their lives, or their perception of how good their life is. 

Life ladder is the only variable that matters in this classifier, and as seen below, the life ladder scores line up fairly well, and almost perfectly with the quartile distribution from above, with the only discrepancy being a life ladder score of 5.5 as opposed to a quartile break of 5.3. 

```{r, include=FALSE}
#8 View the results, what is the most important variable for the tree?
happy_data1_gini_t
#View(happy_data_gini_t$frame)
happy_data1_gini_t$variable.importance
# AJCC.Stage is most important
```

```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, error=FALSE}
#9 Plot the tree using the rpart.plot package
rpart.plot(happy_data1_gini_t, type =4, extra = 101)

```

##### Optimal number of splits

The relative error is the relative error for predictions generating the tree. The xerror is the cross validated error. The xstd is the standard deviation of cross validated errors. These variables lead to the inequality:
where the split should be chosen at the lowest level where rel_error + xstd < xerror.

The graph below plots the X relative error, or xerror, on the y-axis and the complexity on the x-axis. From calculations, we found that xerror exceeds opt after the 3rd split, where  xerror is 0.2234 and opt is 0.21314. In the graph, the threshold appears to be crossed after the third split. This would indicate an optimal split at the fourth level. Thus, the plot and the table comparing opt and xerror do not agree, and we choose to take 4 splits as the optimal amount because we prefer it to line up with the quartile designation. The optimal cp, or cp at four split is .01 as seen in the table below.

```{r, echo=FALSE, warning=FALSE, message=FALSE}
#10 plot and convert the cp table to a data.frame
#HIQ Version
plotcp(happy_data1_gini_t)
```

```{r, echo=FALSE, eval=TRUE}
#11 Add together the real error and standard error to create a new column and determine the optimal number of splits.
cptable_t <- as.data.frame(happy_data1_gini_t$cptable, )
cptable_t$opt <- cptable_t$`rel error`+ cptable_t$xstd
kable(cptable_t)
#View(cptable_t)
# At 3 splits xerror (0.2234)  exceeds opt (0.21314)
```

#### Model Evaluation
```{r, include=FALSE}
#12 Use the predict function and your model to predict the target variable.
happy_data1_fitted_t = predict(happy_data1_gini_t, type= "class")
#View(as.data.frame(happy_data_fitted_t))
```


##### Confusion Matrix
```{r, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE, error=FALSE}
# creating confusion matrix to compare actual and predicted values
hiq_conf_matrix = table(happy_data1_fitted_t, happy_data1$HIQ)
hiq_conf_matrix
```

The confusion matrix above compares the predicted values (Q1-4) of the quartile generated by the model to the actual quartile. Here, rows are predicted quartiles while columns are actual quartiles. From the confusion matrix it appears that we are correctly identifying 85.5% of observations. Our model seems to only misclassify with one quartile up or down (i.e. quartile 2 either being misclassified as 1 or 3 but never as 4).

##### Hit and Detection Rate

The error rate is defined as a mis-classification of Quartile, (e.g. predicting Q1 when the true class is Q4), divided by the total number of data points. The error rate for this model is 14.5%. This is a fairly low error rate.  The hit rate is the portion of predictions that were correctly identified, or 1-error rate = 85.5%.

```{r, echo=FALSE, include=FALSE}

# Diagonal of matrix (correctly identified observations)
sum(hiq_conf_matrix[row(hiq_conf_matrix)== col(hiq_conf_matrix)])
hiq_accur_rate = sum(hiq_conf_matrix[row(hiq_conf_matrix)== col(hiq_conf_matrix)]) / sum(hiq_conf_matrix)
paste0("Correctly Identifying:", hiq_accur_rate * 100, "%")
hiq_error_rate = sum(hiq_conf_matrix[row(hiq_conf_matrix)!= col(hiq_conf_matrix)]) / sum(hiq_conf_matrix)
paste0("Error Rate:", hiq_error_rate * 100, "%")
# "Error Rate: 16.19%"
```


##### Comparison of Results to Base Rates
```{r, eval=TRUE, echo=FALSE, include=FALSE}
#comparing to four base rates
HIQ_Chk <- happy_data1 %>%
  select(HIQ)
HIQ_Chk <- as.data.frame(HIQ_Chk)
HIQ_Chk$Predicted <- happy_data1_fitted_t
HIQ_Chk <- HIQ_Chk %>%
  mutate(correct = if_else(HIQ==Predicted, 1, 0),	 
     	guessedQ1 = if_else(HIQ!=Predicted & Predicted=="1", 1, 0),
     	guessedQ2 = if_else(HIQ!=Predicted & Predicted=="2", 1, 0),
     	guessedQ3 = if_else(HIQ!=Predicted & Predicted=="3", 1, 0),
     	guessedQ4 = if_else(HIQ!=Predicted & Predicted=="4", 1, 0)) %>%
  group_by(HIQ) %>%
  summarise(correct=sum(correct), guessedQ1=sum(guessedQ1),
        	guessedQ2=sum(guessedQ2), guessedQ3=sum(guessedQ3),
        	guessedQ4=sum(guessedQ4), total=n())
HIQ_Chk <- merge(HIQ_Chk, HIQ_BR)
#View(HIQ_Chk)
# Quick Output:
HIQ_Chk$accuracy <- HIQ_Chk$correct/HIQ_Chk$total
HIQ_Chk_ <- HIQ_Chk %>% select(HIQ, accuracy, hiq_br)
#View(HIQ_Chk_)
kable(HIQ_Chk_)
```

The following offers a deeper analysis of the performance of our model based on the results of our confusion matrix, which compares the predicted values (Q1-4) of the quartile generated by the model to the actual happiness quartile. From the confusion matrix it appears that we are correctly identifying [fraction], or 85.5% of observations with error coming from the next highest or lowest quartile.

  	1. For Q1, we've correctly identified 83.11% of observations which is considerably better than our base rate of 25%. Our model misidentified 17 Q2 observations as Q1.

  	2. For Q2, our model is performing quite well, correctly identifying 83.11% of observations, considerably better than our base rate of 25%. Our model misidentified 26 Q1 observations and 16 Q3 observations as Q2.

  	3. For Q3, we've correctly identified 83.0% of observations considerably better than our base rate of 25%. The model misidentified 9 Q2 observations and 11 Q4 observations as Q3.
	
    4. For Q4, we've correctly identified 92.9% of observations, considerably better than out base rate of 25%. Our model misidentified 10 Q3 observations as Q4.


##### ROC and AUC Score
The ROC curve is above the y = x baseline, meaning the tree is better than randomly guessing. The line plotting Sensitivity against Specificity shows a decent model, with an overall relatively high Multi-class Area Under the Curve (AUC): .9527.
There are a few conclusions to glean from this. First is that this decision tree model has merit in being significantly better than guessing for some classes, while the model might not be perfect it is a large step up from using nothing.
```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, include=FALSE }
#16 Generate a ROC and AUC output, interpret the results
h_roc <- multiclass.roc(happy_data1$HIQ, as.numeric(happy_data1_fitted_t), plot = TRUE)
h_roc$auc
```

##### Metric to optimize

If our primary goal is to identify the highest quartile of happiness, Q4, we would lower the probability threshold for assigning observations to Q4 while trying to preserve our high degree of accuracy for other quartiles of happiness. The results of lowering the probability threshold for Q4 tumors to 0.07 are shown in the confusion matrix below. Note that lowering this threshold means we are now correctly identifying all 154 of the Q4 observations correctly (as intended). However, we are no longer identifying any Q3 observations, meaning at this threshold, all 137 Q3 observations are being classified as Q4. Because the threshold needs to be this low in order to correctly classify all actual Q4 observations, but at the same time this causes the model to classify all Q3 observations as Q4, this means that the model has a hard time distinguishing between Q3 and Q4.

```{r, include=FALSE}
#17 Use the predict function to generate percentages, then select several different threshold levels using the confusion matrix function and interpret the results? What metric should we be trying to optimize.
#t_roc1 <- multiclass.roc(happy_data$HIQ), ifelse(happy_data_fitprob_t[,'T1'] >= 0.35, 0, 1), plot = TRUE)
set.seed(1980)
happy_data1_fitprob_t = predict(happy_data1_gini_t, type= "prob")
happy_data1_fitprob_t <- as.data.frame(happy_data1_fitprob_t)
happy_data1_fitprob_t = happy_data1_fitprob_t %>%
  mutate(outcome = case_when(`4`>=.07 ~ "4",
                        	`4`<0.3 & `2`>=`1` & `2`>=`3` ~"2",
                        	`4`<0.3 & `3`>=`2` & `3`>=`1` ~"3",
                        	`4`<0.3 & `1`>=`2` & `1`>=`3` ~"1"))
happy_data1_fitprob_t$outcome <- as.factor(happy_data1_fitprob_t$outcome)
table(happy_data1_fitprob_t$outcome, happy_data1$HIQ)
```
```{r, eval=TRUE, echo=FALSE, include=TRUE}
table(happy_data1_fitprob_t$outcome, happy_data1$HIQ)
```


#### Hyperparameter adjustment

Adjusting the complexity (cp) threshold to 0.01 yields an identical model to before. [We still misidentify quartiles for only ones one above or one below; and our accuracy remains the same for each class, and thus overall.]
 
The optimal cp from earlier was a cp of .01. The decision tree model was rerun with this optimal cp yeilding identical results. This could be because the intial decision tree already had three splits.

```{r, echo=FALSE, include=FALSE}

set.seed(1980)
happy_data1_gini_t2 = rpart(HIQ~.,  
  method = "class",  parms = list(split = "gini"),
  data = happy_data1,  control = rpart.control(cp=.01))

#19 Try adjusting several other hyperparameters via rpart.control and review the model evaluation #Change CP Threshold to to much lower cutoff at 0.0625
#<- includes depth zero, the control for additional options (could use CP, 0.01 is the default)
plotcp(happy_data1_gini_t2)
#View(bc_tree_gini_t2$frame)
rpart.plot(happy_data1_gini_t2, type =4, extra = 101)

cptable_t2 <- as.data.frame(happy_data1_gini_t2$cptable, )
cptable_t2$opt <- cptable_t2$`rel error`+ cptable_t2$xstd

# The quality of the model is impacted by:
#View(cptable_t)
#View(cptable_t2)
set.seed(1980)
happy_data1_fitted_t2 = predict(happy_data1_gini_t2, type= "class")
table(happy_data1_fitted_t2, happy_data1$HIQ)
table(happy_data1_fitted_t, happy_data1$HIQ)
```

#### Decision Tree Model, no life ladder

Next, we will investigate what the decision tree looks like without the life ladder variable, that functioned as an almost perfect classifier. In order to prevent overfitting of the model, we set minsplit to 93 where minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted. Thus, at least over 15% of the data must be in a node in order for a split to be attempted.

The most important variable of this tree is Healthy life expectancy at birth, and the next  most important variable is Log GDP per capita. The decision tree can be seen below and is evidently significantly more complicated than the first tree, just based on the variable life ladder.

```{r, include=FALSE}
# selecting for all variables but Life.Ladder
happy_data2 <- happy_data %>% select(HIQ,year,Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption, Positive.affect, Negative.affect, Confidence.in.national.government, Democratic.Quality, Delivery.Quality)

# running tree with default settings
happy_data2 <- as.data.frame(happy_data2)
set.seed(1980)
happy_data2_gini_t = rpart(HIQ~.,  
  method = "class",  parms = list(split = "gini"),
  data = happy_data2, minsplit=93)
```

```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, error=FALSE}
#9 Plot the tree using the rpart.plot package
rpart.plot(happy_data2_gini_t, type =4, extra = 101)
```

#### Variable Importance

Delving more into variable importance, the table below shows the variable importance value for each variable. Note that healthy life expectancy at birth is the most important variable while generosity is the least important variable in predicting the quartile of happiness a country is in.

```{r, include=TRUE, echo=FALSE, eval=TRUE}
#happy_data2_gini_t
#View(happy_data_gini_t$frame)
kable(happy_data2_gini_t$variable.importance)
```

The graph below depicts this variable importance visually. It is interesting to note that healthy life expectancy at birth and log GDP per capita both rank significantly more important than other variables, as both have variable importance values at least 1.6x higher than the next most important variable, delivery quality. Thus, if a country were to want to increase their happiness ranking, without taking into account people's perception of their life quality (life ladder), they could focus more on the maternity services provided in their hospitals and the GDP per capita. 

```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, error=FALSE}
df <- data.frame(`importance` = happy_data2_gini_t$variable.importance)
df2 <- df %>% 
  tibble::rownames_to_column() %>% 
  dplyr::rename("variable" = rowname) %>% 
  dplyr::arrange(`importance`) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable))
ggplot2::ggplot(df2) +
  geom_col(aes(x = variable, y = `importance`),
           col = "black", show.legend = F) +
  coord_flip() +
  scale_fill_grey() +
  theme_bw()
```

#### Model Evaluation

Let's evaluate this model in comparison with the first decision tree model we generated. 
```{r, include=FALSE}
#12 Use the predict function and your model to predict the target variable.
happy_data2_fitted_t = predict(happy_data2_gini_t, type= "class")
#View(as.data.frame(happy_data_fitted_t))
```

##### Confusion Matrix
```{r, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE, error=FALSE}
# creating confusion matrix to compare actual and predicted values
hiq2_conf_matrix = table(happy_data2_fitted_t, happy_data2$HIQ)
hiq2_conf_matrix
```

The confusion matrix above compares the predicted values (Q1-4) of the quartile generated by the model to the actual quartile. Here, rows are predicted quartiles while columns are actual quartiles. From the confusion matrix it appears that we are correctly identifying 72.2% of observations, significantly lower than the above model that correctly identified 85.5% of observations. Our model seems to misclassify with all other quartiles as seen in predicted Q3, not to only misclassify with one quartile up or down as the first model did.

##### Hit and Detection Rate

The error rate is defined as a mis-classification of Quartile, (e.g. predicting Q1 when the true class is Q4), divided by the total number of data points. The error rate for this model is 27.8%. This is a fairly high error rate.  The hit rate is the portion of predictions that were correctly identified, or 1-error rate = 72.2%.

##### Comparison of Results to Base Rates
```{r, eval=TRUE, echo=FALSE, include=FALSE}
#comparing to four base rates
HIQ_Chk2 <- happy_data2 %>%
  select(HIQ)
HIQ_Chk2 <- as.data.frame(HIQ_Chk2)
HIQ_Chk2$Predicted <- happy_data2_fitted_t
HIQ_Chk2 <- HIQ_Chk2 %>%
  mutate(correct = if_else(HIQ==Predicted, 1, 0),	 
     	guessedQ1 = if_else(HIQ!=Predicted & Predicted=="1", 1, 0),
     	guessedQ2 = if_else(HIQ!=Predicted & Predicted=="2", 1, 0),
     	guessedQ3 = if_else(HIQ!=Predicted & Predicted=="3", 1, 0),
     	guessedQ4 = if_else(HIQ!=Predicted & Predicted=="4", 1, 0)) %>%
  group_by(HIQ) %>%
  summarise(correct=sum(correct), guessedQ1=sum(guessedQ1),
        	guessedQ2=sum(guessedQ2), guessedQ3=sum(guessedQ3),
        	guessedQ4=sum(guessedQ4), total=n())
HIQ_Chk2 <- merge(HIQ_Chk2, HIQ_BR)
#View(HIQ_Chk2)
# Quick Output:
HIQ_Chk2$accuracy <- HIQ_Chk2$correct/HIQ_Chk2$total
HIQ_Chk3 <- HIQ_Chk2 %>% select(HIQ, accuracy, hiq_br)
#View(HIQ_Chk3)
kable(HIQ_Chk3)
```

The following offers a deeper analysis of the performance of our model based on the results of our confusion matrix, which compares the predicted values (Q1-4) of the quartile generated by the model to the actual happiness quartile. From the confusion matrix it appears that we are correctly identifying [fraction], or 72.2% of observations with error coming from the next highest or lowest quartile.

  	1. For Q1, we've correctly identified 87.3% of observations which is considerably better than our base rate of 25%. This is lower compared to the first model correct Q1 identification as 83.11%.

  	2. For Q2, our model correctly identifyies 46.4% of observations, considerably better than our base rate of 25%. This is lower compared to the first model correct Q2 identification as 83.11%.

  	3. For Q3, we've correctly identified 51.8% of observations considerably better than our base rate of 25%. This is lower compared to the first model correct Q3 identification as 83.0%.

  	4. For Q4, we've correctly identified 96.2% of observations, considerably better than out base rate of 25%. This is lower compared to the first model correct Q4 identification as 92.9%.

##### ROC and AUC Score
The ROC curve is above the y = x baseline, meaning the tree is better than randomly guessing. The line plotting Sensitivity against Specificity shows a decent model, with an overall relatively high Multi-class Area Under the Curve (AUC): .9015.

```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, include=FALSE }
#16 Generate a ROC and AUC output, interpret the results
h_roc2 <- multiclass.roc(happy_data2$HIQ, as.numeric(happy_data2_fitted_t), plot = TRUE)
h_roc2$auc
```

### Recommendations

This model has very serious real world implications. The most important variable of this decision tree was life ladder. When taken out of the equation, the most important variable became healthy life expectancy at birth. It is important to note that optimization of only the variables described in this analysis as opposed to taking a holistic approach could harm a countries actual happiness while improving their score, in a similar mechanism as the U.S. News and World report college ranking variable optimization. One could argue that receiving a false classification in a lower quartile is less harmful than receiving a false classification in a higher quartile, as the former could make that country (if they pay attention to the scores) work harder to increase the happiness level of their citizens.

## Conclusions

We've created a correlation matrix between all of the numeric variables such as Happiness Index (HI), year, Life Ladder, Log GDP per capita, social support, health life expectancy at birth, freedom to make life choices, generosity, perceptions of corruption, and positive/negative affect. The matrix tells us how each variable is correlated with respect to every other quantitative factor and the strength of that correlation. For instance, HI has a strongly positive correlation with Life Ladder. Similarly, perceptions of corruption are negatively correlated with happiness index as expected.

In sum, we implemented the k-means algorithm and determined that the variance in the data is best explained by grouping countries into three distinct clusters. This finding proved robust both when we included and removed the self-reported happiness rating, Life Ladder. Upon evaluating our model’s clustering assignments against the continuous happiness index, we found that there were significant similarities in characteristics across countries at the higher and lower end of the happiness index spectrum. The model also adequately identified countries lying toward the center of the happiness index distribution (i.e. the average or moderately happy countries). However, the broad range of characteristics (higher intra-cluster variance) of this cluster revealed that the k-means algorithm is not as well suited for distinguishing between the moderately unhappy and moderately happy countries.

The decision tree model further demonstrated life ladder as the most important variable in determining a country's happiness, with higher life ladder scores conferring higher quartile of country happiness. In conclusion, in focusing on improving a country's happiness, special care should be put into ensuring that people perceive their lives as great, as that seems to be the number one determinant.