-
Notifications
You must be signed in to change notification settings - Fork 0
/
DS4001_Final_Project_Grp8.Rmd
961 lines (763 loc) · 59.4 KB
/
DS4001_Final_Project_Grp8.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
---
title: "What Makes a Happy Country?"
author: "Abby Newbury, Shivani Das, Doug Schwartz, and Cam Bailey"
date: "12/8/2020"
output:
html_document:
toc: TRUE
theme: spacelab
toc_float: TRUE
toc_collapsed: TRUE
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE, echo = FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, error = FALSE, message = FALSE, cache = FALSE)
```
## I. Background and Data Processing
As the World Happiness Report’s [website](https://worldhappiness.report/ed/2020/) states “The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. The World Happiness Report 2020 for the first time ranks cities around the world by their subjective well-being and digs more deeply into how the social, urban and natural environments combine to affect our happiness.”
The World Happiness Report is a publication of the Sustainable Development Solutions Network. The happiness scores and rankings use data from the Gallup World Poll, and the scores are based on answers to the main life evaluation question asked in a poll. This report was written by independent experts and does not necessarily reflect the views of the United Nations. The 156 observations in the data represent different countries in each row.
### I.A Research Question and Background{.tabset}
#### I.A.1 Research Question
The research question that we would like to examine is what are the crucial determinants of happiness in a country? In order to answer this question, we plan to explore the information that is recorded for the 26 variables for each country in the columns are: the country’s name, year, life ladder, Log GDP per capita, social support, health life expectancy in birth, freedom to make life choices, generosity, perceptions of corruption, positive affect, negative affect, confidence in national government, democratic quality, delivery quality, standard deviation of ladder by country-year, Standard deviation/Mean of ladder by country-year, GINI index (World Bank estimate), GINI index (World Bank estimate) average 2000-2017 unbalanced panel, GINI of household income reported in Gallup by wp5-year, Most people can be trusted, Most people can be trusted WVS round 1981-1984, Most people can be trusted WVS round 1989-1993, Most people can be trusted WVS round 1994-98, Most people can be trusted WVS round 1999-2004, Most people can be trusted WVS round 2005-2009, and Most people can be trusted WVS round 2010-2014. These represent an estimate of the extent/contribution of each factor on a country’s happiness score.
#### I.A.2 Relevant Literature
Additionally, we wanted to examine previous literature that also explores country happiness. For instance, this [website](http://www.gnhcentrebhutan.org/what-is-gnh/gnh-happiness-index/) about the GNH Happiness Index used in Bhutan (GNH) describes how the GNH index is a holistic approach to measure the happiness and wellbeing of the Bhutanese population. The GNH index is a measurement tool used for policy making to increase GNH. It includes the nine domains which are further supported by the 33 indicators. The Index analyzes the nation’s wellbeing with each person’s achievements in each indicator. In addition to analyzing the happiness and wellbeing of the people, it also guides how policies may be designed to further create enabling conditions for the weaker scoring results of the survey.
The New York Times wrote an interesting article about the results from the 2020 World Happiness Report with special consideration of the ongoing COVID-19 pandemic. The article says that happiness isn’t a function of how well positive emotions are expressed, but rather, it’s a measure of general satisfaction with life, and the confidence in a living a secure life according to John F. Helliwell, an editor of the annual happiness report. Happy people “wouldn’t have the highest smile factor,” he said. “They do trust each other and care about each other, and that’s what fundamentally makes for a better life.” - [NYT](https://www.nytimes.com/2020/03/20/world/europe/world-happiness-report.html).
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
library(rio)
library(plyr)
library(dplyr)
library(tidyverse)
library(rpart)
library(psych)
library(pROC)
library(rpart.plot)
library(rattle)
library(caret)
library(knitr)
library(tibble)
library(tidyverse)
library(tidytext)
library(knitr)
library(XLConnect)
library(dplyr)
library(countrycode)
library(zoo)
library(gtools)
library(NbClust)
library(e1071)
library(class)
library(plotly)
library(ggplot2)
```
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
#Read in and combine all sheets of workbook
happy_data_all <- readWorksheetFromFile("Happiness_Index_Data.xls", sheet = 1)
happy_data_all <- happy_data_all %>%
mutate(country_year = paste(Country, as.character(year), sep = "_"))
happy_index_all <- readWorksheetFromFile("Happiness_Index_Data.xls", sheet = 2)
# 2017 data before cleaning
data.2017.preclean <- happy_data_all[which(happy_data_all$year=="2017"),]
country_count.2004 <- sum(happy_data_all$year == "2004")
country_count.2005 <- sum(happy_data_all$year == "2005")
country_count.2006 <- sum(happy_data_all$year == "2006")
country_count.2007 <- sum(happy_data_all$year == "2007")
country_count.2008 <- sum(happy_data_all$year == "2008")
country_count.2009 <- sum(happy_data_all$year == "2009")
country_count.2010 <- sum(happy_data_all$year == "2010")
country_count.2011 <- sum(happy_data_all$year == "2011")
country_count.2012 <- sum(happy_data_all$year == "2012")
country_count.2013 <- sum(happy_data_all$year == "2013")
country_count.2014 <- sum(happy_data_all$year == "2014")
country_count.2015 <- sum(happy_data_all$year == "2015")
country_count.2016 <- sum(happy_data_all$year == "2016")
country_count.2017 <- sum(happy_data_all$year == "2017")
country_count.2018 <- sum(happy_data_all$year == "2018")
country_count.2019 <- sum(happy_data_all$year == "2019")
happy_reader <- function(num_x) {
new_sheet <- readWorksheetFromFile("Happiness_Index_Data.xls", sheet = num_x)
happy_index_all <- merge(happy_index_all, new_sheet, by.x = "Country",
by.y = "Country", all.x = TRUE, all.y = TRUE)
}
happy_index_all <- happy_reader(3)
happy_index_all <- happy_reader(4)
happy_index_all <- happy_reader(5)
happy_index_all <- happy_reader(6)
happy_index_all <- happy_reader(7)
happy_index_all <- happy_index_all %>%
gather(key = yr, value = HI, -Country) %>%
mutate(year = as.integer(gsub("HH_", "", yr, fixed = TRUE)) -1) %>%
mutate(country_year = paste(Country, as.character(year), sep = "_")) %>%
select(-yr)
happy_data_all <-merge(happy_data_all, happy_index_all, by.x = "country_year",
by.y = "country_year", all.x = FALSE, all.y = TRUE)
happy_data_all <- happy_data_all %>%
separate(country_year, into = c('Country','year'), sep="_") %>%
select(-Country.x, -Country.y, -year.x, -year.y) %>%
mutate(year = as.integer(year))
#Only keep independent variables that are at least 70% populated
colSums(is.na(happy_data_all)) / nrow(happy_data_all)
# removing variables less than 70% populated
happy_data <- happy_data_all %>% select(HI, Country, year, Life.Ladder, Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption, Positive.affect, Negative.affect, Confidence.in.national.government, Democratic.Quality, Delivery.Quality)
# removing rows where HI is NA
happy_data <- happy_data[complete.cases(happy_data[ , 1]),]
happy_data <- happy_data[complete.cases(happy_data),]
happy_data$quartile <- quantcut(happy_data$HI, q=4, na.rm=TRUE)
happy_data$HIQ <- factor(happy_data$quartile,labels = c("1", "2", "3", "4"))
# 2017 data after cleaning
data.2017.cleaned <- happy_data_all[which(happy_data_all$year=="2017"),]
```
## II. Exploratory Data Analysis
### II.A Initial Summary Statistics
#### II.A.1 Count of Countries per Year
Before cleaning the data, the amount of countries listed in this report varies year by year, from 2005 to 2019. 2005 has the least amount of participating countries with 27, and 2017 the most with 147. After cleaning the data for variables less than 70% populated and from the years 2014-2019, 117 countries are left for analysis for each of the 5 years.
```{r, echo=FALSE}
country_count <- data.frame("Year" = c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014", "2015","2016","2017","2018", "2019"), "Count_of_Countries" = c(country_count.2005, country_count.2006, country_count.2007, country_count.2008, country_count.2009, country_count.2010, country_count.2011, country_count.2012, country_count.2013, country_count.2014, country_count.2015, country_count.2016, country_count.2017, country_count.2018, country_count.2019))
country_count.table <- kable(country_count, format = "simple", caption = "Count of Countries by Year")
country_count.table
```
#### II.A.2 Distribution of Happiness Values
One reason we are able to move forward with our method of cleaning the data (explained in II.B) is that the remaining 117 countries are a representative sample of the population. This can be seen by the Happiness Index distribution on the histograms below, before and after cleaning.
```{r}
par(mfrow=c(2,1))
hist.2017.preclean <- hist(data.2017.preclean$Life.Ladder, main = "Histogram of Country Happiness in 2017: Before Cleaning", xlab = "Happiness Index", col = "darkmagenta")
hist.2017.cleaned <- hist(data.2017.cleaned$Life.Ladder, main = "Histogram of Country Happiness in 2017: After Cleaning", xlab = "Happiness Index", col = "blue")
```
#### II.A.3 Average Happiness Over Time
```{r, echo=FALSE}
hi.2014 <- happy_index_all[which(happy_index_all$year=="2014"),]
hi.2015 <- happy_index_all[which(happy_index_all$year=="2015"),]
hi.2016 <- happy_index_all[which(happy_index_all$year=="2016"),]
hi.2017 <- happy_index_all[which(happy_index_all$year=="2017"),]
hi.2018 <- happy_index_all[which(happy_index_all$year=="2018"),]
hi.2019 <- happy_index_all[which(happy_index_all$year=="2019"),]
avg_HI <- data.frame("Year" = 2014:2019, "Average" = c(mean(hi.2014$HI, na.rm = TRUE), mean(hi.2015$HI, na.rm = TRUE), mean(hi.2016$HI, na.rm = TRUE), mean(hi.2017$HI, na.rm = TRUE), mean(hi.2018$HI, na.rm = TRUE), mean(hi.2019$HI, na.rm = TRUE)), "Afghanistan" = happy_index_all[which(happy_index_all$Country=="Afghanistan"),2], "Brazil" = happy_index_all[which(happy_index_all$Country=="Brazil"),2], "China" = happy_index_all[which(happy_index_all$Country=="China"),2], "Germany" = happy_index_all[which(happy_index_all$Country=="Germany"),2], "Jamaica" = happy_index_all[which(happy_index_all$Country=="Jamaica"),2], "Libya" = happy_index_all[which(happy_index_all$Country=="Libya"),2], "New_Zealand" = happy_index_all[which(happy_index_all$Country=="New Zealand"),2], "Philippines" = happy_index_all[which(happy_index_all$Country=="Philippines"),2], "Somalia" = happy_index_all[which(happy_index_all$Country=="Somalia"),2], "United_States" = happy_index_all[which(happy_index_all$Country=="United States"),2])
HI_over_time <- plot_ly(avg_HI, x = ~Year, y = ~Average, name = 'Average', type = 'scatter', mode = 'lines') %>%
add_trace(y = ~Afghanistan, name = 'Afghanistan', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Brazil, name = 'Brazil', type = 'scatter', mode = 'lines') %>% add_trace(y = ~China, name = 'China', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Germany, name = 'Germany', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Jamaica, name = 'Jamaica', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Libya, name = 'Libya', type = 'scatter', mode = 'lines') %>% add_trace(y = ~New_Zealand, name = 'New Zealand', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Philippines, name = 'Philippines', type = 'scatter', mode = 'lines') %>% add_trace(y = ~Somalia, name = 'Somalia', type = 'scatter', mode = 'lines') %>% add_trace(y = ~United_States, name = 'United States', type = 'scatter', mode = 'lines')
x_axis <- seq(2014, 2019, by = 1)
HI_over_time <- HI_over_time %>%
layout(
title = "Happiness Index over Time",
xaxis = list(title="Year"),
yaxis = list(title="Happiness Index")
)
HI_over_time
```
### II.B Justify Data Processing Decisions
We collected all six years of data from the Happiness Dataset from 2015 to 2020, and then assigned the outcome variables to match up to the year in the data for which the outcome variables were determined in the original dataset. For processing and cleaning the data, we first read in and combined all sheets from the workbook. Next, we decided to only keep the independent variables that were at least 70% populated and removed the rest. Finally, we decided to remove rows where the Happiness Index (HI) was NA or missing and drop any observations that didn't have all the variables populated for the decision tree models since we didn't want to impute missing values. After cleaning and processing all the data, we were still left with 615 observations from the original amount of 1026 observations
#### II.B.1 Data Characteristics
The type of data is a sample, as not all of the countries are present in the data set. A poll was implemented to gather data on the variables of interest. This data from the Gallup World Poll was used to determine the influence of each to calculate the happiness score 1 and rank for each country. While the data set does not mention how the specific countries were selected to be put in the data set, we can see that there are observations for all of the larger and more prominent countries of the world. We see that the countries that tend to be missing are the smaller ones, where possibly polling was simply not conducted or polling was not deemed to be suitable. Looking at the data set, the qualitative variables in the data set are the name and region of each country. All of the other variables are quantitative.
#### II.B.2 Data Issues
There are a couple of issues with the data set. The first issue is that this data set only has 157 countries, while there are 195 countries recognized by the United Nations. This would impact the statistical calculations and graphics made. The various statistics, such as average and median, most likely would change with all of these countries being represented. Additionally, graphs generated representing the data would change with more countries being present. The countries with the lowest happiness score could change, the boxplots representing each region could change, and trends would be more thoroughly seen if all of the countries were present. Another potential issue is that polls were conducted to determine the values for each of the factors in respect to the happiness score. We don’t know how these surveys were conducted in each country, if countries took it seriously or didn’t, if polling was consistent across the countries, and if the answers from these polls are entirely representative of the country’s entire population. This would lead to inaccurate representation in the data.
#### II.B.3 Correlation Matrix
We've created a correlation matrix between all of the numeric variables such as Happiness Index (HI), year, Life Ladder, Log GDP per capita, social support, health life expectancy at birth, freedom to make life choices, generosity, perceptions of corruption, and positive/negative affect. The matrix tells us how each variable is correlated with respect to every other quantitative factor and the strength of that correlation. For instance, HI has a strongly positive correlation with Life Ladder. Similarly, perceptions of corruption are negatively correlated with happiness index as expected.
```{r}
#Correlation Matrix
happy <- happy_data
happy.numeric <- happy[,sapply(happy,is.numeric)]
matrix <- cor(happy.numeric, method="pearson")
knitr::kable(round(matrix,2))
```
## III. Clustering Analysis
In analyzing our data, we have defined the variable of interest to be the quartile in which the happiness index falls for a given country and year. Accordingly the base rate would be 25% of our observations for each quartile, meaning the probability that we correctly assign a country-year observation to the correct quartile (by random chance) is 0.25, by construction. In this section, we instead try to cluster the data using our explanatory variables to group together countries with similar characteristics. Using both the elbow method and the results of the NbClust function, we’ll determine the optimal number of clusters into which the data will be grouped. Once countries with similar characteristics have been sorted into this optimal number of groups, we will examine the happiness index scores associated with each grouping with the expectation that countries within a grouping would likely have similar happiness index scores.
### III.A. Clustering k-means
In order to apply the k-means algorithm, the data required a few modifications. First, we removed the variable of interest from the data (both the factorized quartile and raw happiness index measure). The goal of this exercise is to group together countries with similar characteristics under the hypothesis that these similarities would imply similar measures of happiness. Accordingly, Country was also removed as we don't want that factor variable to provide explanatory power (lest our advice for an unhappy country be to try not being that country). The other factor variable in our dataset, the year of the observation, does seem relevant as global events in a particular year can certainly explain shifts in a country’s happiness, e.g. a global pandemic. Rather than utilizing dummy coding as we would for nominal factor variables, we instead allowed year to be treated as a numeric variable and applied the same standardization as the rest of the variables. Having created a final data set consisting only of numeric variables, we then standardized the entire dataset.
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
happy_data_num <- happy_data[,sapply(happy_data, is.numeric)]
happy_data_num <- happy_data_num %>%
select(-HI, -year)
happy_data_num <- as.data.frame(scale(happy_data_num))
```
#### III.A.1 Optimal Number of Clusters{.tabset}
Applying the k-means function to assign observations to as few as one or up to ten different clusters, we seek to identify the number of clusters that will maximize the inter-cluster variance, (i.e. the sum of the distances between points from different clusters) subject to the constraint of minimizing the intra-cluster variance (the sum of the distances between points within the same cluster).
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
explained_variance = function(data_in, k){
set.seed(1)
kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 50)
var_exp = kmeans_obj$betweenss / kmeans_obj$totss
}
explained_var_happy = sapply(1:10, explained_variance, data_in = happy_data_num)
elbow_data_happy = data.frame(k = 1:10, explained_var_happy)
```
##### III.A.1.1 Elbow Method
The elbow graph shown below plots the number of clusters used against the fraction of the total variance accounted for by the designated number of clusters. The latter refers to the ratio of inter-cluster variance relative to the total variance in the data (i.e. the sum of the distances between all the points in the data set). The ‘elbow’ seems to indicate that the point at which the increased inter-cluster variance relative to the smaller intra-cluster variance would be around 3 clusters.
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(elbow_data_happy,
aes(x = k,
y = explained_var_happy)) +
geom_point(size = 4) +
geom_line(size = 1) +
xlab('Number of Clusters') +
ylab('Inter-cluster Variance / Total Variance') +
theme_light()
```
##### III.A.1.2 NbClust Majority Rule
Using the NbClust function we find that the optimal number of clusters is 3; as shown on the graph below, the majority of models (12) voted for 3 clusters. This reaffirms our conclusion drawn from our initial inspection of the elbow chart.
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
nbclust_happy = NbClust(data = happy_data_num, method = "kmeans")
freq_k_happy = nbclust_happy$Best.nc[1,]
freq_k_happy = data.frame(freq_k_happy)
```
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(freq_k_happy,
aes(x = freq_k_happy)) +
geom_bar() +
scale_x_continuous(breaks = seq(0, max(freq_k_happy), by = 1)) +
scale_y_continuous(breaks = seq(0, 12, by = 1)) +
labs(x = "Number of Clusters",
y = "Number of Votes",
title = "Cluster Analysis")
```
#### III.A.2 Assigning Optimal Number of Clusters{.tabset}
Using the recommended number of clusters, we find that 3 clusters explains 44% of the total variance. Assigning the predicted clusterings to the actual data we can then visualize the output to show that our model does extremely well at assigning countries to clusters reflecting overall happiness.
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
kmeans_obj_happy = kmeans(happy_data_num, centers = 3, algorithm = "Lloyd", iter.max = 50)
(var_exp_k_happy = kmeans_obj_happy$betweenss / kmeans_obj_happy$totss)
Final_Happy_Clusters <- happy_data
Final_Happy_Clusters$Cluster <- as.factor(kmeans_obj_happy$cluster)
```
As the graphs below illustrate, our clusters do quite well at identifying countries on the lower and higher end of the happiness index range. However, as mentioned earlier, the self-reported happiness variable (Life Ladder) appears to be too tightly correlated with our dependent variable. In the following subsection we will explore what the results of our clustering analysis would be without this explanatory variable.
##### Life Ladder
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = HI,
y = Life.Ladder,
color = Cluster,
shape = HIQ)) +
geom_point(aes(color = Cluster), size = 2) +
ggtitle("Life Ladder and Happiness Index (HI) Clusters") +
xlab("Happiness Index (HI)") +
ylab("Life Ladder") +
scale_shape_manual(name = "HI Quartile",
labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
values = c("1", "2","3", "4")) +
scale_color_manual(name = "Cluster",
labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
values = c("blue", "grey", "red")) +
theme_light()
```
##### GDP-per-Capita
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = HI,
y = Log.GDP.per.capita,
color = Cluster,
shape = HIQ)) +
geom_point(aes(color = Cluster), size = 2) +
ggtitle("Log GDP per Capita and Happiness Index (HI) Clusters") +
xlab("Happiness Index (HI)") +
ylab("Log GDP per Capita") +
scale_shape_manual(name = "HI Quartile",
labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
values = c("1", "2","3", "4")) +
scale_color_manual(name = "Cluster",
labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
values = c("blue", "grey", "red")) +
theme_light()
```
##### Life Expectancy
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = HI,
y = Healthy.life.expectancy.at.birth ,
color = Cluster,
shape = HIQ)) +
geom_point(aes(color = Cluster), size = 2) +
ggtitle("Life Expectancy and Happiness Index (HI) Clusters") +
xlab("Happiness Index (HI)") +
ylab("Healthy Life Expectancy at Birth") +
scale_shape_manual(name = "HI Quartile",
labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
values = c("1", "2","3", "4")) +
scale_color_manual(name = "Cluster",
labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
values = c("blue", "grey", "red")) +
theme_light()
```
### III.B. Revised Clustering Analysis
As discussed in the section above, the self-reported happiness variable (Life Ladder) is very highly correlated with the happiness index, our dependent variable. In this section we remove the variable from our data and redo our clustering analysis. The goal of this analysis is to identify countries with happiness index scores that diverge from what our expectations would be based upon all the other factors (the growth in GDP, trust in government, life expectancy, etc.).
#### III.B.1 Removing Life Ladder{.tabset}
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
happy_data_num_b <- happy_data_num %>% select(-Life.Ladder)
explained_var_happy_b = sapply(1:10, explained_variance, data_in = happy_data_num_b)
elbow_data_happy_b = data.frame(k = 1:10, explained_var_happy_b)
```
As before, we now re-apply the k-means function to assign observations to as few as one or up to ten different clusters, seeking to identify the number of clusters that will maximize the inter-cluster variance and minimize the intra-cluster variance.
##### III.B.1.1 Elbow Method
The elbow graph shown below plots the number of clusters used against the fraction of the total variance accounted for by the designated number of clusters. The ‘elbow’ seems to indicate that the point at which the increased inter-cluster variance relative to the smaller intra-cluster variance would still be around 3 clusters.
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(elbow_data_happy_b,
aes(x = k,
y = explained_var_happy_b)) +
geom_point(size = 4) +
geom_line(size = 1) +
xlab('Number of Clusters') +
ylab('Inter-cluster Variance / Total Variance') +
theme_light()
```
##### III.B.1.2 NbClust Majority Rule
Using the NbClust function we find that the optimal number of clusters is 3; as shown on the graph below, the majority of models (12) voted for 3 clusters. This reaffirms our conclusion drawn from our initial inspection of the elbow chart and is consistent with our findings prior to removing Life Ladder from our data.
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
nbclust_happy_b = NbClust(data = happy_data_num_b, method = "kmeans")
freq_k_happy_b = nbclust_happy_b$Best.nc[1,]
freq_k_happy_b = data.frame(freq_k_happy_b)
```
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(freq_k_happy_b,
aes(x = freq_k_happy_b)) +
geom_bar() +
scale_x_continuous(breaks = seq(0, max(freq_k_happy_b), by = 1)) +
scale_y_continuous(breaks = seq(0, 12, by = 1)) +
labs(x = "Number of Clusters",
y = "Number of Votes",
title = "Cluster Analysis")
```
#### III.B.2 Assigning Optimal Number of Clusters{.tabset}
Using the recommended number of clusters on the revised data set, we find that 3 clusters explains 42.6% of the variance which is only slightly less than the 44% of variance accounted for when applying the same number of clusters and including the variable Life Ladder. Assigning the predicted clusters to the actual data we can then visualize the output to show that our model still does quite well at assigning countries to clusters reflecting overall happiness.
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
set.seed(1)
kmeans_obj_happy_b = kmeans(happy_data_num_b, centers = 3, algorithm = "Lloyd", iter.max = 50)
(var_exp_k_happy_b = kmeans_obj_happy_b$betweenss / kmeans_obj_happy_b$totss)
Final_Happy_Clusters$Cluster_b <- as.factor(kmeans_obj_happy_b$cluster)
```
As before, we now plot our revised clusters against the observed happiness index and variables of interest. Note that even without training the clusters using the self-reported variable Life Ladder, our clusters still do well at identifying countries on the high and low end of the happiness index spectrum.
##### Life Ladder
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = HI,
y = Life.Ladder,
color = Cluster_b,
shape = HIQ)) +
geom_point(aes(color = Cluster_b), size = 2) +
ggtitle("Life Ladder and Happiness Index (HI) Clusters") +
xlab("Happiness Index (HI)") +
ylab("Life Ladder") +
scale_shape_manual(name = "HI Quartile",
labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
values = c("1", "2","3", "4")) +
scale_color_manual(name = "Cluster",
labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
values = c("blue", "grey", "red")) +
theme_light()
```
##### GDP-per-Capita
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = HI,
y = Log.GDP.per.capita,
color = Cluster_b,
shape = HIQ)) +
geom_point(aes(color = Cluster_b), size = 2) +
ggtitle("Log GDP per Capita and Happiness Index (HI) Clusters") +
xlab("Happiness Index (HI)") +
ylab("Log GDP per Capita") +
scale_shape_manual(name = "HI Quartile",
labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
values = c("1", "2","3", "4")) +
scale_color_manual(name = "Cluster",
labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
values = c("blue", "grey", "red")) +
theme_light()
```
##### Life Expectancy
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = HI,
y = Healthy.life.expectancy.at.birth ,
color = Cluster_b,
shape = HIQ)) +
geom_point(aes(color = Cluster_b), size = 2) +
ggtitle("Life Expectancy and Happiness Index (HI) Clusters") +
xlab("Happiness Index (HI)") +
ylab("Healthy Life Expectancy at Birth") +
scale_shape_manual(name = "HI Quartile",
labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"),
values = c("1", "2","3", "4")) +
scale_color_manual(name = "Cluster",
labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"),
values = c("blue", "grey", "red")) +
theme_light()
```
### III.C Evaluating Clusters
Having chosen the optimal number of clusters and set of explanatory variables, we can assess the results of the k-means clustering analysis using two approaches. First, we will compare the distribution of our clusters against the initially designated quartiles of the happiness index distribution. Then we will examine a series of visualizations of the data to glean insights into our clusters.
```{r, include=FALSE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
Final_Happy_Clusters$Cl_b <- factor(Final_Happy_Clusters$Cluster_b, labels = c("Happy Cluster", "Average Cluster", "Unhappy Cluster"))
comp_clust_HI = table(as_factor(Final_Happy_Clusters$Cl_b), Final_Happy_Clusters$HIQ)
```
#### III.C.1 Pseudo-Confusion Matrix
The following table compares the actual happiness quartile assignments and the cluster assigned by our model. Note that no countries falling in the lowest quartile of the HI were assigned to the cluster associated with happier countries and vice versa. The majority of observations in both the happier and unhappier clusters are concentrated in the 1st and 4th quartiles, respectively. This is a good indication that there are strong similarities across the characteristics of countries that are the happiest and least happy. Note that the cluster that spans all four quartiles of the happiness index does not necessarily indicate a good or bad fit, but rather reflects that by just choosing the quartiles of a continuous distribution, there may not be significant differences between a country whose happiness index falls near the threshold between one quartile and another.
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
kable(comp_clust_HI)
```
#### III.C.2 Visualizing Final Clusters{.tabset}
Having seen that our clusters fit well when plotted against the happiness index itself, we can begin to explore how particular variables influenced a given cluster assignment by plotting explanatory variables and contrasting the assigned cluster and observed happiness index. The graphs below plot pairs of explanatory variables against our predicted happiness clustering (the shape) and the actual assigned happiness index (color scale).
##### Social Support
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = Healthy.life.expectancy.at.birth,
y = Social.support,
shape = Cluster_b)) +
geom_point(aes(color = HI), size = 2) +
ggtitle("Life Expectancy & Social Support by Happiness Index (HI) and Cluster") +
xlab("Healthy Life Expectancy at Birth") +
ylab("Social Support") +
scale_shape_manual(name = "Cluster",
labels = c("Happy", "Average", "Unhappy"),
values = c(2, 20, 6)) +
scale_color_viridis_c() +
theme_light()
```
##### GDP-per-Capita
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = Healthy.life.expectancy.at.birth,
y = Log.GDP.per.capita,
shape = Cluster_b)) +
geom_point(aes(color = HI), size = 2) +
ggtitle("Life Expectancy & GDP by Happiness Index (HI) and Cluster") +
xlab("Healthy Life Expectancy at Birth") +
ylab("Log GDP per Capita") +
scale_shape_manual(name = "Cluster",
labels = c("Happy", "Average", "Unhappy"),
values = c(2, 20, 6)) +
scale_color_viridis_c() +
theme_light()
```
##### Democratic Quality
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = Delivery.Quality,
y = Democratic.Quality,
shape = Cluster_b)) +
geom_point(aes(color = HI), size = 2) +
ggtitle("Delivery & Democratic Quality by Happiness Index (HI) and Cluster") +
xlab("Delivery Quality") +
ylab("Democratic Quality") +
scale_shape_manual(name = "Cluster",
labels = c("Happy", "Average", "Unhappy"),
values = c(2, 20, 6)) +
scale_color_viridis_c() +
theme_light()
```
##### Perceived Corruption
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = Perceptions.of.corruption,
y = Confidence.in.national.government,
shape = Cluster_b)) +
geom_point(aes(color = HI), size = 2) +
ggtitle("Corruption and Govt Confidence by Happiness Index (HI) and Cluster") +
xlab("Perceptions of Corruption") +
ylab("Confidence in National Government") +
scale_shape_manual(name = "Cluster",
labels = c("Happy", "Average", "Unhappy"),
values = c(2, 20, 6)) +
scale_color_viridis_c() +
theme_light()
```
##### Generosity
```{r, include=TRUE, warning=FALSE, message=FALSE, error=FALSE, echo=FALSE, eval=TRUE}
ggplot(Final_Happy_Clusters,
aes(x = Generosity,
y = Freedom.to.make.life.choices,
shape = Cluster_b)) +
geom_point(aes(color = HI), size = 2) +
ggtitle("Generosity and Freedom by Happiness Index (HI) and Cluster") +
xlab("Generosity") +
ylab("Freedom to Make Life Choices") +
scale_shape_manual(name = "Cluster",
labels = c("Happy", "Average", "Unhappy"),
values = c(2, 20, 6)) +
scale_color_viridis_c() +
theme_light()
```
### III.D Conclusion of k-means
In sum, we implemented the k-means algorithm and determined that the variance in the data is best explained by grouping countries into three distinct clusters. This finding proved robust both when we included and removed the self-reported happiness rating, Life Ladder. Upon evaluating our model’s clustering assignments against the continuous happiness index, we found that there were significant similarities in characteristics across countries at the higher and lower end of the happiness index spectrum. The model also adequately identified countries lying toward the center of the happiness index distribution (i.e. the average or moderately happy countries). However, the broad range of characteristics (higher intra-cluster variance) of this cluster revealed that the k-means algorithm is not as well suited for distinguishing between the moderately unhappy and moderately happy countries.
## Decision Tree
```{r, include=FALSE}
library(rio)
library(plyr)
library(dplyr)
library(tidyverse)
library(rpart)
library(psych)
library(pROC)
#install.packages("rpart.plot")
library(rpart.plot)
#install.packages("rattle")
library(rattle)
library(caret)
library(knitr)
library(tibble)
```
The purpose of this decision tree is to classify each country into happiness quartiles based on variables such as life ladder, social support, and democratic quality. The decision tree model will first be built using default settings but then the threshold will be adjusted to optimize for both the highest and lowest quartiles, allowing us to glean insights into which factors contribute the most to countries happiness.
### Methods
The Happiness Index was changed into quartiles, where the quartile distributio is as follows:
```{r, echo=FALSE, eval=TRUE}
quantile(happy_data$HI)
happy_data1 <- happy_data %>% select(HIQ,year, Life.Ladder, Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption, Positive.affect, Negative.affect, Confidence.in.national.government, Democratic.Quality, Delivery.Quality)
```
Any rows with NA were removed from the data frame in order to perform the decision tree analysis.
##### Base rate calculation
The base rate for this classifier is the individual percentages for each quartile. Quartile 1 has a base rate of 25.04%, Quartile 2: 25.04%, Quartile 3: 24.9%, and Quartile 4: 25.04%. This base rate is as expected when distributing data into quartiles.
``` {r, include=FALSE}
#6 For the multi-class this will be the individual percentages for each class.
library(dplyr)
dplyr::summarize
detach("package:plyr", unload=TRUE)
table(happy_data$HIQ)
# happy_data: happy_data
# HIQ_BR: HIQ_BR
# tumor_cnt: hiq_cnt
# tumor_br: hiq_br
HIQ_BR <- happy_data %>%
group_by(HIQ) %>%
summarize(hiq_cnt=n())
HIQ_BR$hiq_br <- (HIQ_BR$hiq_cnt / length(happy_data$HIQ))
(h_br <- (HIQ_BR$hiq_cnt / length(happy_data$HIQ)))
HIQ_BR
```
#### Build model using default settings
``` {r, include=FALSE}
#7 Build your model using the default settings
happy_data1 <- as.data.frame(happy_data1)
set.seed(1980)
happy_data1_gini_t = rpart(HIQ~.,
method = "class", parms = list(split = "gini"),
data = happy_data1)
```
The most important variable for the tree is Life.Ladder. The first split in the tree is created using this variable, as seen below, with one split of life ladder less than 5.5 and the other of life ladder greater than or equal to 5.5. Life ladder is where people rate their own lives on a 0 to 10 scale with 10 being the best possible life. Thus, it seems that the most importantly variable for a country's happiness is how its people rate their lives, or their perception of how good their life is.
Life ladder is the only variable that matters in this classifier, and as seen below, the life ladder scores line up fairly well, and almost perfectly with the quartile distribution from above, with the only discrepancy being a life ladder score of 5.5 as opposed to a quartile break of 5.3.
```{r, include=FALSE}
#8 View the results, what is the most important variable for the tree?
happy_data1_gini_t
#View(happy_data_gini_t$frame)
happy_data1_gini_t$variable.importance
# AJCC.Stage is most important
```
```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, error=FALSE}
#9 Plot the tree using the rpart.plot package
rpart.plot(happy_data1_gini_t, type =4, extra = 101)
```
##### Optimal number of splits
The relative error is the relative error for predictions generating the tree. The xerror is the cross validated error. The xstd is the standard deviation of cross validated errors. These variables lead to the inequality:
where the split should be chosen at the lowest level where rel_error + xstd < xerror.
The graph below plots the X relative error, or xerror, on the y-axis and the complexity on the x-axis. From calculations, we found that xerror exceeds opt after the 3rd split, where xerror is 0.2234 and opt is 0.21314. In the graph, the threshold appears to be crossed after the third split. This would indicate an optimal split at the fourth level. Thus, the plot and the table comparing opt and xerror do not agree, and we choose to take 4 splits as the optimal amount because we prefer it to line up with the quartile designation. The optimal cp, or cp at four split is .01 as seen in the table below.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
#10 plot and convert the cp table to a data.frame
#HIQ Version
plotcp(happy_data1_gini_t)
```
```{r, echo=FALSE, eval=TRUE}
#11 Add together the real error and standard error to create a new column and determine the optimal number of splits.
cptable_t <- as.data.frame(happy_data1_gini_t$cptable, )
cptable_t$opt <- cptable_t$`rel error`+ cptable_t$xstd
kable(cptable_t)
#View(cptable_t)
# At 3 splits xerror (0.2234) exceeds opt (0.21314)
```
#### Model Evaluation
```{r, include=FALSE}
#12 Use the predict function and your model to predict the target variable.
happy_data1_fitted_t = predict(happy_data1_gini_t, type= "class")
#View(as.data.frame(happy_data_fitted_t))
```
##### Confusion Matrix
```{r, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE, error=FALSE}
# creating confusion matrix to compare actual and predicted values
hiq_conf_matrix = table(happy_data1_fitted_t, happy_data1$HIQ)
hiq_conf_matrix
```
The confusion matrix above compares the predicted values (Q1-4) of the quartile generated by the model to the actual quartile. Here, rows are predicted quartiles while columns are actual quartiles. From the confusion matrix it appears that we are correctly identifying 85.5% of observations. Our model seems to only misclassify with one quartile up or down (i.e. quartile 2 either being misclassified as 1 or 3 but never as 4).
##### Hit and Detection Rate
The error rate is defined as a mis-classification of Quartile, (e.g. predicting Q1 when the true class is Q4), divided by the total number of data points. The error rate for this model is 14.5%. This is a fairly low error rate. The hit rate is the portion of predictions that were correctly identified, or 1-error rate = 85.5%.
```{r, echo=FALSE, include=FALSE}
# Diagonal of matrix (correctly identified observations)
sum(hiq_conf_matrix[row(hiq_conf_matrix)== col(hiq_conf_matrix)])
hiq_accur_rate = sum(hiq_conf_matrix[row(hiq_conf_matrix)== col(hiq_conf_matrix)]) / sum(hiq_conf_matrix)
paste0("Correctly Identifying:", hiq_accur_rate * 100, "%")
hiq_error_rate = sum(hiq_conf_matrix[row(hiq_conf_matrix)!= col(hiq_conf_matrix)]) / sum(hiq_conf_matrix)
paste0("Error Rate:", hiq_error_rate * 100, "%")
# "Error Rate: 16.19%"
```
##### Comparison of Results to Base Rates
```{r, eval=TRUE, echo=FALSE, include=FALSE}
#comparing to four base rates
HIQ_Chk <- happy_data1 %>%
select(HIQ)
HIQ_Chk <- as.data.frame(HIQ_Chk)
HIQ_Chk$Predicted <- happy_data1_fitted_t
HIQ_Chk <- HIQ_Chk %>%
mutate(correct = if_else(HIQ==Predicted, 1, 0),
guessedQ1 = if_else(HIQ!=Predicted & Predicted=="1", 1, 0),
guessedQ2 = if_else(HIQ!=Predicted & Predicted=="2", 1, 0),
guessedQ3 = if_else(HIQ!=Predicted & Predicted=="3", 1, 0),
guessedQ4 = if_else(HIQ!=Predicted & Predicted=="4", 1, 0)) %>%
group_by(HIQ) %>%
summarise(correct=sum(correct), guessedQ1=sum(guessedQ1),
guessedQ2=sum(guessedQ2), guessedQ3=sum(guessedQ3),
guessedQ4=sum(guessedQ4), total=n())
HIQ_Chk <- merge(HIQ_Chk, HIQ_BR)
#View(HIQ_Chk)
# Quick Output:
HIQ_Chk$accuracy <- HIQ_Chk$correct/HIQ_Chk$total
HIQ_Chk_ <- HIQ_Chk %>% select(HIQ, accuracy, hiq_br)
#View(HIQ_Chk_)
kable(HIQ_Chk_)
```
The following offers a deeper analysis of the performance of our model based on the results of our confusion matrix, which compares the predicted values (Q1-4) of the quartile generated by the model to the actual happiness quartile. From the confusion matrix it appears that we are correctly identifying [fraction], or 85.5% of observations with error coming from the next highest or lowest quartile.
1. For Q1, we've correctly identified 83.11% of observations which is considerably better than our base rate of 25%. Our model misidentified 17 Q2 observations as Q1.
2. For Q2, our model is performing quite well, correctly identifying 83.11% of observations, considerably better than our base rate of 25%. Our model misidentified 26 Q1 observations and 16 Q3 observations as Q2.
3. For Q3, we've correctly identified 83.0% of observations considerably better than our base rate of 25%. The model misidentified 9 Q2 observations and 11 Q4 observations as Q3.
4. For Q4, we've correctly identified 92.9% of observations, considerably better than out base rate of 25%. Our model misidentified 10 Q3 observations as Q4.
##### ROC and AUC Score
The ROC curve is above the y = x baseline, meaning the tree is better than randomly guessing. The line plotting Sensitivity against Specificity shows a decent model, with an overall relatively high Multi-class Area Under the Curve (AUC): .9527.
There are a few conclusions to glean from this. First is that this decision tree model has merit in being significantly better than guessing for some classes, while the model might not be perfect it is a large step up from using nothing.
```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, include=FALSE }
#16 Generate a ROC and AUC output, interpret the results
h_roc <- multiclass.roc(happy_data1$HIQ, as.numeric(happy_data1_fitted_t), plot = TRUE)
h_roc$auc
```
##### Metric to optimize
If our primary goal is to identify the highest quartile of happiness, Q4, we would lower the probability threshold for assigning observations to Q4 while trying to preserve our high degree of accuracy for other quartiles of happiness. The results of lowering the probability threshold for Q4 tumors to 0.07 are shown in the confusion matrix below. Note that lowering this threshold means we are now correctly identifying all 154 of the Q4 observations correctly (as intended). However, we are no longer identifying any Q3 observations, meaning at this threshold, all 137 Q3 observations are being classified as Q4. Because the threshold needs to be this low in order to correctly classify all actual Q4 observations, but at the same time this causes the model to classify all Q3 observations as Q4, this means that the model has a hard time distinguishing between Q3 and Q4.
```{r, include=FALSE}
#17 Use the predict function to generate percentages, then select several different threshold levels using the confusion matrix function and interpret the results? What metric should we be trying to optimize.
#t_roc1 <- multiclass.roc(happy_data$HIQ), ifelse(happy_data_fitprob_t[,'T1'] >= 0.35, 0, 1), plot = TRUE)
set.seed(1980)
happy_data1_fitprob_t = predict(happy_data1_gini_t, type= "prob")
happy_data1_fitprob_t <- as.data.frame(happy_data1_fitprob_t)
happy_data1_fitprob_t = happy_data1_fitprob_t %>%
mutate(outcome = case_when(`4`>=.07 ~ "4",
`4`<0.3 & `2`>=`1` & `2`>=`3` ~"2",
`4`<0.3 & `3`>=`2` & `3`>=`1` ~"3",
`4`<0.3 & `1`>=`2` & `1`>=`3` ~"1"))
happy_data1_fitprob_t$outcome <- as.factor(happy_data1_fitprob_t$outcome)
table(happy_data1_fitprob_t$outcome, happy_data1$HIQ)
```
```{r, eval=TRUE, echo=FALSE, include=TRUE}
table(happy_data1_fitprob_t$outcome, happy_data1$HIQ)
```
#### Hyperparameter adjustment
Adjusting the complexity (cp) threshold to 0.01 yields an identical model to before. [We still misidentify quartiles for only ones one above or one below; and our accuracy remains the same for each class, and thus overall.]
The optimal cp from earlier was a cp of .01. The decision tree model was rerun with this optimal cp yeilding identical results. This could be because the intial decision tree already had three splits.
```{r, echo=FALSE, include=FALSE}
set.seed(1980)
happy_data1_gini_t2 = rpart(HIQ~.,
method = "class", parms = list(split = "gini"),
data = happy_data1, control = rpart.control(cp=.01))
#19 Try adjusting several other hyperparameters via rpart.control and review the model evaluation #Change CP Threshold to to much lower cutoff at 0.0625
#<- includes depth zero, the control for additional options (could use CP, 0.01 is the default)
plotcp(happy_data1_gini_t2)
#View(bc_tree_gini_t2$frame)
rpart.plot(happy_data1_gini_t2, type =4, extra = 101)
cptable_t2 <- as.data.frame(happy_data1_gini_t2$cptable, )
cptable_t2$opt <- cptable_t2$`rel error`+ cptable_t2$xstd
# The quality of the model is impacted by:
#View(cptable_t)
#View(cptable_t2)
set.seed(1980)
happy_data1_fitted_t2 = predict(happy_data1_gini_t2, type= "class")
table(happy_data1_fitted_t2, happy_data1$HIQ)
table(happy_data1_fitted_t, happy_data1$HIQ)
```
#### Decision Tree Model, no life ladder
Next, we will investigate what the decision tree looks like without the life ladder variable, that functioned as an almost perfect classifier. In order to prevent overfitting of the model, we set minsplit to 93 where minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted. Thus, at least over 15% of the data must be in a node in order for a split to be attempted.
The most important variable of this tree is Healthy life expectancy at birth, and the next most important variable is Log GDP per capita. The decision tree can be seen below and is evidently significantly more complicated than the first tree, just based on the variable life ladder.
```{r, include=FALSE}
# selecting for all variables but Life.Ladder
happy_data2 <- happy_data %>% select(HIQ,year,Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption, Positive.affect, Negative.affect, Confidence.in.national.government, Democratic.Quality, Delivery.Quality)
# running tree with default settings
happy_data2 <- as.data.frame(happy_data2)
set.seed(1980)
happy_data2_gini_t = rpart(HIQ~.,
method = "class", parms = list(split = "gini"),
data = happy_data2, minsplit=93)
```
```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, error=FALSE}
#9 Plot the tree using the rpart.plot package
rpart.plot(happy_data2_gini_t, type =4, extra = 101)
```
#### Variable Importance
Delving more into variable importance, the table below shows the variable importance value for each variable. Note that healthy life expectancy at birth is the most important variable while generosity is the least important variable in predicting the quartile of happiness a country is in.
```{r, include=TRUE, echo=FALSE, eval=TRUE}
#happy_data2_gini_t
#View(happy_data_gini_t$frame)
kable(happy_data2_gini_t$variable.importance)
```
The graph below depicts this variable importance visually. It is interesting to note that healthy life expectancy at birth and log GDP per capita both rank significantly more important than other variables, as both have variable importance values at least 1.6x higher than the next most important variable, delivery quality. Thus, if a country were to want to increase their happiness ranking, without taking into account people's perception of their life quality (life ladder), they could focus more on the maternity services provided in their hospitals and the GDP per capita.
```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, error=FALSE}
df <- data.frame(`importance` = happy_data2_gini_t$variable.importance)
df2 <- df %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(`importance`) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable))
ggplot2::ggplot(df2) +
geom_col(aes(x = variable, y = `importance`),
col = "black", show.legend = F) +
coord_flip() +
scale_fill_grey() +
theme_bw()
```
#### Model Evaluation
Let's evaluate this model in comparison with the first decision tree model we generated.
```{r, include=FALSE}
#12 Use the predict function and your model to predict the target variable.
happy_data2_fitted_t = predict(happy_data2_gini_t, type= "class")
#View(as.data.frame(happy_data_fitted_t))
```
##### Confusion Matrix
```{r, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE, error=FALSE}
# creating confusion matrix to compare actual and predicted values
hiq2_conf_matrix = table(happy_data2_fitted_t, happy_data2$HIQ)
hiq2_conf_matrix
```
The confusion matrix above compares the predicted values (Q1-4) of the quartile generated by the model to the actual quartile. Here, rows are predicted quartiles while columns are actual quartiles. From the confusion matrix it appears that we are correctly identifying 72.2% of observations, significantly lower than the above model that correctly identified 85.5% of observations. Our model seems to misclassify with all other quartiles as seen in predicted Q3, not to only misclassify with one quartile up or down as the first model did.
##### Hit and Detection Rate
The error rate is defined as a mis-classification of Quartile, (e.g. predicting Q1 when the true class is Q4), divided by the total number of data points. The error rate for this model is 27.8%. This is a fairly high error rate. The hit rate is the portion of predictions that were correctly identified, or 1-error rate = 72.2%.
##### Comparison of Results to Base Rates
```{r, eval=TRUE, echo=FALSE, include=FALSE}
#comparing to four base rates
HIQ_Chk2 <- happy_data2 %>%
select(HIQ)
HIQ_Chk2 <- as.data.frame(HIQ_Chk2)
HIQ_Chk2$Predicted <- happy_data2_fitted_t
HIQ_Chk2 <- HIQ_Chk2 %>%
mutate(correct = if_else(HIQ==Predicted, 1, 0),
guessedQ1 = if_else(HIQ!=Predicted & Predicted=="1", 1, 0),
guessedQ2 = if_else(HIQ!=Predicted & Predicted=="2", 1, 0),
guessedQ3 = if_else(HIQ!=Predicted & Predicted=="3", 1, 0),
guessedQ4 = if_else(HIQ!=Predicted & Predicted=="4", 1, 0)) %>%
group_by(HIQ) %>%
summarise(correct=sum(correct), guessedQ1=sum(guessedQ1),
guessedQ2=sum(guessedQ2), guessedQ3=sum(guessedQ3),
guessedQ4=sum(guessedQ4), total=n())
HIQ_Chk2 <- merge(HIQ_Chk2, HIQ_BR)
#View(HIQ_Chk2)
# Quick Output:
HIQ_Chk2$accuracy <- HIQ_Chk2$correct/HIQ_Chk2$total
HIQ_Chk3 <- HIQ_Chk2 %>% select(HIQ, accuracy, hiq_br)
#View(HIQ_Chk3)
kable(HIQ_Chk3)
```
The following offers a deeper analysis of the performance of our model based on the results of our confusion matrix, which compares the predicted values (Q1-4) of the quartile generated by the model to the actual happiness quartile. From the confusion matrix it appears that we are correctly identifying [fraction], or 72.2% of observations with error coming from the next highest or lowest quartile.
1. For Q1, we've correctly identified 87.3% of observations which is considerably better than our base rate of 25%. This is lower compared to the first model correct Q1 identification as 83.11%.
2. For Q2, our model correctly identifyies 46.4% of observations, considerably better than our base rate of 25%. This is lower compared to the first model correct Q2 identification as 83.11%.
3. For Q3, we've correctly identified 51.8% of observations considerably better than our base rate of 25%. This is lower compared to the first model correct Q3 identification as 83.0%.
4. For Q4, we've correctly identified 96.2% of observations, considerably better than out base rate of 25%. This is lower compared to the first model correct Q4 identification as 92.9%.
##### ROC and AUC Score
The ROC curve is above the y = x baseline, meaning the tree is better than randomly guessing. The line plotting Sensitivity against Specificity shows a decent model, with an overall relatively high Multi-class Area Under the Curve (AUC): .9015.
```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE, include=FALSE }
#16 Generate a ROC and AUC output, interpret the results
h_roc2 <- multiclass.roc(happy_data2$HIQ, as.numeric(happy_data2_fitted_t), plot = TRUE)
h_roc2$auc
```
### Recommendations
This model has very serious real world implications. The most important variable of this decision tree was life ladder. When taken out of the equation, the most important variable became healthy life expectancy at birth. It is important to note that optimization of only the variables described in this analysis as opposed to taking a holistic approach could harm a countries actual happiness while improving their score, in a similar mechanism as the U.S. News and World report college ranking variable optimization. One could argue that receiving a false classification in a lower quartile is less harmful than receiving a false classification in a higher quartile, as the former could make that country (if they pay attention to the scores) work harder to increase the happiness level of their citizens.
## Conclusions
We've created a correlation matrix between all of the numeric variables such as Happiness Index (HI), year, Life Ladder, Log GDP per capita, social support, health life expectancy at birth, freedom to make life choices, generosity, perceptions of corruption, and positive/negative affect. The matrix tells us how each variable is correlated with respect to every other quantitative factor and the strength of that correlation. For instance, HI has a strongly positive correlation with Life Ladder. Similarly, perceptions of corruption are negatively correlated with happiness index as expected.
In sum, we implemented the k-means algorithm and determined that the variance in the data is best explained by grouping countries into three distinct clusters. This finding proved robust both when we included and removed the self-reported happiness rating, Life Ladder. Upon evaluating our model’s clustering assignments against the continuous happiness index, we found that there were significant similarities in characteristics across countries at the higher and lower end of the happiness index spectrum. The model also adequately identified countries lying toward the center of the happiness index distribution (i.e. the average or moderately happy countries). However, the broad range of characteristics (higher intra-cluster variance) of this cluster revealed that the k-means algorithm is not as well suited for distinguishing between the moderately unhappy and moderately happy countries.
The decision tree model further demonstrated life ladder as the most important variable in determining a country's happiness, with higher life ladder scores conferring higher quartile of country happiness. In conclusion, in focusing on improving a country's happiness, special care should be put into ensuring that people perceive their lives as great, as that seems to be the number one determinant.