-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdefinitive_project.Rmd
989 lines (723 loc) · 40.3 KB
/
definitive_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
---
title: "BDA - Project Work"
author: "Jacopo Losi, Nicola Saljoughi"
output:
pdf_document:
toc: yes
toc_depth: 3
word_document:
toc: yes
toc_depth: '1'
html_document:
df_print: paged
toc: yes
toc_depth: '1'
---
```{r setup, include=FALSE}
# This chunk just sets echo = TRUE as default (i.e. print all code)
knitr::opts_chunk$set(echo = TRUE, tidy = FALSE)
library(aaltobda)
library(arules)
library(bayesplot)
library(brms)
library(devtools)
library(dplyr)
library(easyGgplot2)
library(epitools)
library(ggplot2)
library(KernSmooth)
library(loo)
library(magrittr)
library(MASS)
library(mvtnorm)
library(nnet)
library(phonTools)
library(rcompanion)
library(rstan)
library(rstanarm)
library(shinystan)
library(tableone)
library(tinytex)
```
\clearpage
# Introduction
This project is based on a study carried out in 2015 by a group of researchers to estimate the incidence of serious suicide attempts in Shandong, China, and to examine the factors associated with fatality among the attempters. \newline
We have chosen to examine a dataset on suicides because it is a really important but often underconsidered problem in today's society. Not only this problem reflects a larger problem in a country societal system but it can also be a burden for hospital resources. We think that by being able to talk about it more openly and by truly trying to estimate its size and impact we can start to understand where the causes are rooted and what can be done to fight it. \newline
We invite the reader to check the source section to further read about the setting and results of the named paper. \newline
In this report we will carry out our analysis following the bayesian approach. Since also the frequentist approach was covered during lecture, we though that it was meaningful to compare the two at the begininning of the analysis.\newline
Adopting then the bayesian approach we will first develop a multiple logistic regression model using all the variables, after that we will do variable selection to determine which are the most influential factors and develop a second multiple logistic regression model using the selected variables. After that, we will assess the convergence and efficiency of the models, do posterior predictive checking and compare the models. To conclude we carry out a prediction on the age of the attempters and eventually answer our analysis problem.
## Analysis Problem
The objective of the project is to use the bayesian approach to develop models to evaluate the most influential factors related to serious suicide attempts (SSAs, defined as suicide attempts resulting in either death or hospitalisation) and being able to make predictions on the age of the attempters.
## Data
Data from two independent health surveillance systems were linked, constituted by records of suicide deaths and hospitalisations that occured among residents in selected countries during 2009-2011.
The data set is constituted by 2571 observations of 11 variables:
\begin{itemize}
\item \texttt{Person\_ID}: ID number, $1,...,2571$
\item \texttt{Hospitalised}: \textit{yes} or \textit{no}
\item \texttt{Died}: \textit{yes} or \textit{no}
\item \texttt{Urban}: \textit{yes}, \textit{no} or \textit{unknown}
\item \texttt{Year}: $2009$, $2010$ or $2011$
\item \texttt{Month}: $1,...,12$
\item \texttt{Sex}: \textit{female} or \textit{male}
\item \texttt{Age}: years
\item \texttt{Education}: \textit{iliterate}, \textit{primary}, \textit{Secondary}, \textit{Tertiary} or \textit{unknown}
\item \texttt{Occupation}: one of ten categories
\item \texttt{method}: one of nine methods
\end{itemize}
It is important to notice that the population in the study is predominantly rural and that the limitation of the study is that the incidence estimates are likely to be underestimated due to underreporting in both surveillance systems.
## Source
Sun J, Guo X, Zhang J, Wang M, Jia C, Xu A (2015) "Incidence and fatality of serious suicide attempts in a predominantly rural population in Shandong, China: a public health surveillance study," BMJ Open 5(2): e006762. https://doi.org/10.1136/bmjopen-2014-006762
Data downloaded via Dryad Digital Repository. https://doi.org/10.5061/dryad.r0v35
\clearpage
# Analysis
The analysis is structures as follows:
\begin{itemize}
\item \textbf{Bayesian vs Frequentist}: we will compare the frequentist and the bayesian approach by developing two multiple logistic regression models and assigning to the bayesian model the default flat priors in Stan;
\item \textbf{Parameter Selection}: we will evaluate the most influential factors on the fatality of the SSAs and their correlation to develop the reduced logistic regression model in the following analysis;
\item \textbf{Full and reduced logistic regression models}: after the parameters selection, we develop a first model comparison between the full logistic regression model and the reduced one (the one where we only considered the factors selected in the parameter selection phase);
\item \textbf{Convergence and efficiency analysis}: convergence and efficiency of the models have been assessed using Rhat and ESS respectively and also Hamiltonian Monte Carlo (HMC) specific diagnostic have been computed on both of the models(tree depth and divergence);
\item \textbf{Model comparison}: different models, designed using different families of distributions, have been compared by using leave-one-out cross-validation and the best model has been selected for the remaining analysis;
\item \textbf{Sensitivity analysis}: different priors have been tested on the best model selected in the previous phase, the convergence and efficiency of the resulting models assessed and these compared using loo-cv;
\item \textbf{Predictive checking}: the best model selected in the sensitivity analysis has been used to carry out a predictive checking;
\item \textbf{Age prediction}: to conclude we decided to choose one of the most influential parameters to generate a prediction and answer our analysis problem.
\end{itemize}
## Data prepreocessing
```{r}
datFile <- "suicide attempt data_2.csv"
datCsv <- read.csv(datFile, stringsAsFactors=FALSE)
datSet <- as.data.frame(datCsv)
datSet$Season <- datSet$Month
datSet$Month = NULL
## Remove unknown labels
indexUnkn_1 <- which(datSet$Education == 'unknown')
indexUnkn_2 <- which(datSet$Urban == 'unknown')
indexUnkn_3 <- which(datSet$Occupation == 'others/unknown')
datSet <- datSet[-c(indexUnkn_1, indexUnkn_2,indexUnkn_3),]
# Hospitalised
indexHosp <- which(datSet$Hospitalised == 'yes')
indexNoHosp <- which(datSet$Hospitalised == 'no')
datSet$Hospitalised[indexHosp] <- 1 # 1 --> yes
datSet$Hospitalised[indexNoHosp] <- 0 # 0 --> no
# Died
indexDied <- which(datSet$Died == 'yes')
indexNoDied <- which(datSet$Died == 'no')
datSet$Died[indexDied] <- 1 # 1 --> yes
datSet$Died[indexNoDied] <- 0 # 0 --> no
# Urban
indexUrban <- which(datSet$Urban == 'yes')
indexNoUrban <- which(datSet$Urban == 'no')
datSet$Urban[indexUrban] <- 1 # 1 --> yes
datSet$Urban[indexNoUrban] <- 0 # 0 --> no
#Year
indexYear2009 <- which(datSet$Year == 2009)
indexYear2010 <- which(datSet$Year == 2010)
indexYear2011 <- which(datSet$Year == 2011)
datSet$Year[indexYear2009] <- 1 # 1 --> 2009
datSet$Year[indexYear2010] <- 2 # 2 --> 2010
datSet$Year[indexYear2011] <- 3 # 3 --> 2011
# Sex
indexMale <- which(datSet$Sex == 'male')
indexFemale <- which(datSet$Sex == 'female')
datSet$Sex[indexMale] <- 1 # 1 --> male
datSet$Sex[indexFemale] <- 0 # 0 --> female
# Education
indexEduZero <- which(datSet$Education == 'iliterate')
indexEduOne <- which(datSet$Education == 'primary')
indexEduTwo <- which(datSet$Education == 'Secondary')
indexEduThree <- which(datSet$Education == 'Tertiary')
datSet$Education[indexEduZero] <- 0 # 0 --> iliterate
datSet$Education[indexEduOne] <- 1 # 1 --> primary
datSet$Education[indexEduTwo] <- 2 # 2 --> Secondary
datSet$Education[indexEduThree] <- 3 # 3 --> Tertiary
# Occupation
indexUnEmpl <- which(datSet$Occupation == 'unemployed')
indexFarm <- which(datSet$Occupation == 'farming')
indexProf <- which(datSet$Occupation == 'business/service' | datSet$Occupation == 'professional' | datSet$Occupation == 'worker')
datSet$Occupation[indexUnEmpl] <- 0 # 0 --> unemployed
datSet$Occupation[indexFarm] <- 1 # 1 --> farming
datSet$Occupation[indexProf] <- 2 # 2 --> professional and worker
datSet$Occupation[-c(indexUnEmpl, indexFarm, indexProf)] <- 3 # 3 --> others
# Method
indexPesticide <- which(datSet$method == 'Pesticide')
indexPoison <- which(datSet$method == 'Other poison')
indexHanging <- which(datSet$method == 'hanging')
indexOthers <- which(datSet$method != 'Pesticide' &
datSet$method != 'Other poison' &
datSet$method != 'hanging')
datSet$method[indexPesticide] <- 1 # 1 --> Pesticide
datSet$method[indexPoison] <- 2 # 2 --> Other poison
datSet$method[indexHanging] <- 3 # 3 --> hanging
datSet$method[indexOthers] <- 4 # 4 --> All others
# Season
indexSpring <- which(datSet$Season >= 3 & datSet$Season <= 5)
indexSummer <- which(datSet$Season >= 6 & datSet$Season <= 8)
indexAutumn <- which(datSet$Season >= 9 & datSet$Season <= 11)
indexWinter <- which(datSet$Season == 12 | datSet$Season <= 2)
datSet$Season[indexSpring] <- 1 # 1 --> Spring
datSet$Season[indexSummer] <- 2 # 2 --> Summer
datSet$Season[indexAutumn] <- 3 # 3 --> Autumn
datSet$Season[indexWinter] <- 4 # 4 --> Winter
datSetCluster <- datSet
# Age
indexAgeOne <- which(datSet$Age <= 34)
indexAgeTwo <- which(datSet$Age >= 35 & datSet$Age <= 49)
indexAgeThree <- which(datSet$Age >= 50 & datSet$Age <= 64)
indexAgeFour <- which(datSet$Age >= 65)
datSetCluster$Age[indexAgeOne] <- 1 # 1 --> <34
datSetCluster$Age[indexAgeTwo] <- 2 # 2 --> 35-49
datSetCluster$Age[indexAgeThree] <- 3 # 3 --> 50-64
datSetCluster$Age[indexAgeFour] <- 4 # 4 --> >65
```
## Bayesian vs. Frequentist
As already said, at first we evaluated a comparison between the frequentist and the bayesian approach, in order to provide an analysis of the dataset with both models, being the two approaches discussed during the lectures.
### Model description
In order to evaluate the factors which influence the probability of SSA the most it is an obvious chioice to develop a multiple logistic regression model.
### Prior choices
For the first bayesian model we have assumed flat prior.
Thus, the two model follow below.
### Frequentist approach
```{r}
freqModel <- glm(as.numeric(Died) ~ as.numeric(Urban) +
as.numeric(Year) +
as.numeric(Season) +
as.numeric(Sex) +
as.numeric(Age) +
as.numeric(Education) +
as.numeric(Occupation) +
as.numeric(method),
data = datSetCluster,
family = binomial(link = "logit"))
summary(freqModel)
```
### Bayesian approach using Stan
```{r}
## Create Stan data
datFullBayes <- list(N = nrow(datSetCluster),
p = ncol(datSetCluster) - 2,
died = as.numeric(datSetCluster$Died),
urban = as.numeric(datSetCluster$Urban),
year = as.numeric(datSetCluster$Year),
season = as.numeric(datSetCluster$Season),
sex = as.numeric(datSetCluster$Sex),
age = as.numeric(datSetCluster$Age),
edu = as.numeric(datSetCluster$Education),
job = as.numeric(datSetCluster$Occupation),
method = as.numeric(datSetCluster$method))
## Load Stan file
fileName <- "./logistic_regression_model.stan"
stanCodeFull <- readChar(fileName, file.info(fileName)$size)
cat(stanCodeFull)
```
```{r echo=FALSE}
## SIMPLE LOGISTIC REGRESSION MODEL
# Run Stan
resStanFull <- stan(model_code = stanCodeFull,
data = datFullBayes,
chains = 5,
iter = 2000,
warmup = 800,
thin = 10,
refresh = 0,
seed = 12345,
control = list(adapt_delta = 0.95))
```
```{r}
traceplot(resStanFull, pars = c('beta[3]','beta[4]', 'beta[5]',
'beta[6]', 'beta[7]', 'beta[8]',
'beta[9]'), inc_warmup = TRUE)
```
### Comparison between frequentist and bayesian approach
```{r}
## Bayesian
print(resStanFull, pars = c('beta'))
## Frequentist
tableone::ShowRegTable(freqModel, exp = FALSE)
```
After this first analysis, as aforementioned, we decided to use the Bayesian approach.
Therefore, regarding the analysis that will follow, we first developed two models to evaluate the incidence of the factors on fatality of attempts:
\begin{itemize}
\item \textbf{Full logistic regression model} where all the parameters are included, as the one shown before;
\item \textbf{Reduced logistic regression model} that only includes the parameters selected in the variable selection phase.
\end{itemize}
At the end of the analysis we have devoloped two other models to predict the age of the attempters. These two models correspond to a full model with all the parameters and a reduced one with only the most relevant parameters once again. \newline
## Parameter selection
### Data loading
First of all we load the data. Notice that some processing was done on the original data removing samples with missing entries (that resulted to constitute less than the 6 % of the dataset) and turning labels from strings into integers.
```{r}
## Create Stan data
datFull <- list(N = nrow(datSetCluster),
p = ncol(datSetCluster) - 2,
died = as.numeric(datSetCluster$Died),
urban = as.numeric(datSetCluster$Urban),
year = as.numeric(datSetCluster$Year),
season = as.numeric(datSetCluster$Season),
sex = as.numeric(datSetCluster$Sex),
age = as.numeric(datSetCluster$Age),
edu = as.numeric(datSetCluster$Education),
job = as.numeric(datSetCluster$Occupation),
method = as.numeric(datSetCluster$method))
```
In this phase we are working on testing different models, therefore it is worth to take only some random samples from the data. As a matter of fact, the dataset that we have is big and thus the computation on the whole dataset will take a lot of time.
Therefore, we will proceed as follows:
* we will generate a vector of 50 random number taken from our dataset;
* we will test the models with this data, that are sufficient for not loosing in generality;
* we will run the final model on the whole dataset.
```{r}
random_index <- sample(datSetCluster$Person_ID, size = 50, replace = TRUE)
data_reduced <- datSetCluster[random_index, ]
data_reduced <- na.omit(data_reduced)
```
```{r}
## Create Stan data
dat_red <- list(N = nrow(data_reduced),
p = ncol(data_reduced) - 2,
died = as.numeric(data_reduced$Died),
urban = as.numeric(data_reduced$Urban),
year = as.numeric(data_reduced$Year),
season = as.numeric(data_reduced$Season),
sex = as.numeric(data_reduced$Sex),
age = as.numeric(data_reduced$Age),
edu = as.numeric(data_reduced$Education),
job = as.numeric(data_reduced$Occupation),
method = as.numeric(data_reduced$method))
```
### Full logistic regression model
Here we start by implementing the full logistic regression model.
```{r}
## FULL LOGISTIC REGRESSION MODEL
## Load Stan Model
fileNameOne <- "./logistic_regression_model.stan"
stan_code_full <- readChar(fileNameOne, file.info(fileNameOne)$size)
cat(stan_code_full)
```
### Stan Code Running
The Stan models are run by using five chains constituted by 2000 iterations, a warmup length of 800 iterations and a thin equal to 10.
Thin is a positive integer that specifies the period for saving samples; it is set by default = 1, and it is normally left to defaults. In our case though our posterior distribution takes up a lot of memory even when using a reduced dataset and we require a large numer of iteration to achieve effective sample size and therefore we decide to set it to 10 in this phase.
```{r echo=FALSE}
## SIMPLE LOGISTIC REGRESSION MODEL
# Run Stan
resStanFull <- stan(model_code = stan_code_full,
data = dat_red,
chains = 5,
iter = 2000,
warmup = 800,
thin = 10,
refresh = 0,
seed = 12345,
control = list(adapt_delta = 0.95))
print(resStanFull, pars = c('beta'))
```
### Variable selection
In this section we will evaluate the most influential factors and their correlation in order to select the most descriptive ones that will be used to contruct our second model (the reduced logistic regression model).\newline
First of all we process our data:
```{r}
# Transform fitting over beta in a dataframe for the plots
beta_matrix <- zeros(length(extract(resStanFull)$beta[,1]), ncol(data_reduced) - 2)
for (i in 1:ncol(data_reduced) - 2)
beta_matrix[,i] = beta_matrix[,i] + extract(resStanFull)$beta[,i]
beta_df <- as.data.frame(beta_matrix)
```
Now we show traceplots and generate scatter plots in order to evaluate the correlation between the parameters:
```{r}
# Generate some scatter plots in order to see the correlations between parameters
scatter_1 <- ggplot(beta_df, aes(x=V3, y=V7)) +
ggtitle("Correlation between location and education") +
xlab("Urban") + ylab("Education") +
geom_point(size=1, shape=23) +
geom_smooth(method=lm, linetype="dashed", color="darkred", fill="blue")
scatter_2 <- ggplot(beta_df, aes(x=V3, y=V8)) +
ggtitle("Correlation between location and occuption") +
xlab("Urban") + ylab("Occupation") +
geom_point(size=1, shape=23) +
geom_smooth(method=lm, linetype="dashed", color="darkred", fill="blue")
scatter_3 <- ggplot(beta_df, aes(x=V5, y=V6)) +
ggtitle("Correlation between gender and age") +
xlab("Gender") + ylab("Age") +
geom_point(size=1, shape=23) +
geom_smooth(method=lm, linetype="dashed", color="darkred", fill="blue")
ggplot2.multiplot(scatter_1,scatter_2,scatter_3, cols=1)
```
Now we overlay histogram, density and mean value of the parameters. The most interesting plots are presented; using the mean value is interesting since we can understand which are the parameters that influence more the posterior.
Thus, looking at the weight of the parameters in the histograms, it is possible to suppose with enough precision which are the most informative parameters.
```{r}
plot_1 <- qplot(extract(resStanFull)$beta[,3], geom = 'blank',
xlab = 'Values of weigth', ylab = 'Occurences', main='Urbans') +
geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) +
geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,3])), col=I('yellow'), linetype="dashed", size=1)
plot_2 <- qplot(extract(resStanFull)$beta[,5], geom = 'blank',
xlab = 'Values of weigth', ylab = 'Occurences', main='Sex') +
geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) +
geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,5])), col=I('yellow'), linetype="dashed", size=1)
plot_3 <- qplot(extract(resStanFull)$beta[,6], geom = 'blank',
xlab = 'Values of weigth', ylab = 'Occurences', main='Age') +
geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) +
geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,6])), col=I('yellow'), linetype="dashed", size=1)
plot_4 <- qplot(extract(resStanFull)$beta[,7], geom = 'blank',
xlab = 'Values of weigth', ylab = 'Occurences', main='Education') +
geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) +
geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,7])), col=I('yellow'), linetype="dashed", size=1)
plot_5 <- qplot(extract(resStanFull)$beta[,8], geom = 'blank',
xlab = 'Values of weigth', ylab = 'Occurences', main='Occupation') +
geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) +
geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,8])), col=I('yellow'), linetype="dashed", size=1)
ggplot2.multiplot(plot_1,plot_2,plot_3,plot_4, plot_5, cols=3)
```
From the analysis done above, and especially looking at the histogram, it is clear that the most important parameters that count in our analysis are: the fact that the people come from urban or rural areas, then their education, occupation and partially if they are man or woman. As a matter of fact, the mean and the maximum values of the coeffcient related to those paramters have the bigger magnitude. This means that those parameters are weighted more in the multi regression function in the model.
Therefore, for further analysis, it will be good to develop specific analysis using only these parameters, in order to have a more precise evalution considering only the most relevant parameters.
### Full logistic regression model
For completeness we report here the full model once again.
```{r}
## FULL LOGISTIC REGRESSION MODEL
## Load Stan Model
fileNameOne <- "./logistic_regression_model.stan"
stan_code_full <- readChar(fileNameOne, file.info(fileNameOne)$size)
cat(stan_code_full)
```
### Reduced logistic regression model
We can now implement the reduced logistic regression model using the selected parameters.
```{r}
## REDUCED LOGISTIC REGRESSION MODEL
## Load Stan Model
fileNameOneDef <- "./logistic_regression_model_def.stan"
stan_code_simple_def <- readChar(fileNameOneDef, file.info(fileNameOneDef)$size)
cat(stan_code_simple_def)
```
### Stan Code Running
Now we are going to run the model on the full dataset. \newline
First we define the Stan data to run the second (reduced) model.
```{r}
## Create Stan data
dat_def <- list(N = nrow(datSetCluster),
p = 4,
died = as.numeric(datSetCluster$Died),
sex = as.numeric(datSetCluster$Sex),
age = as.numeric(datSetCluster$Age),
edu = as.numeric(datSetCluster$Education))
```
We first run the full model. The settings are the same as before except that now we are using the full dataset and default value for thin.
```{r echo=FALSE}
## FULL LOGISTIC REGRESSION MODEL
# Run Stan
resStanFull <- stan(model_code = stan_code_full,
data = datFull,
chains = 5,
iter = 2000,
warmup = 800,
thin = 1,
refresh = 0,
seed = 12345,
control = list(adapt_delta = 0.95))
print(resStanFull, pars = c('beta'))
```
Now we run the reduced order model.
```{r echo=FALSE}
## REDUCED LOGISTIC REGRESSION MODEL
# Run Stan
resStanRed <- stan(model_code = stan_code_simple_def,
data = dat_def,
chains = 5,
iter = 2000,
warmup = 800,
thin = 1,
refresh = 0,
seed = 12345,
control = list(adapt_delta = 0.95))
print(resStanRed, pars = c('beta'))
```
## Convergence and efficiency analysis
In this section we are going to analyse the implemented models, both in terms of convergence (assessed using R-hat and HMC specific convergence diagnostic) and efficiency (by computing the Effective Sample Size). \newline
### R-hat
R-hat convergence diagnostic compares between- and within-chain estimates for model parameters and other univariate quantities of interest. If chains have not mixed well R-hat is larger than 1. In practical terms, it is good practice to use at least four chains and using the sample if R-hat is less than 1.05. \newline
We can see from the result of \texttt{print(fit)} we have just displayed that all the Rhat values are equal to one for both the models and therefore we have convergence.
### HMC
Here we compute convergence diagnostic specific to Hamiltonian Monte Carlo, and in particular divergences and tree depth.\newline
The following code computes the diagnostic for the full model:
```{r, fig.width=8, fig.height=5, warning=FALSE}
## Full model HMC diagnostic
check_hmc_diagnostics(resStanFull)
```
As we can see none of the interations ended with a divergence nor saturated the maximum tree depth. \newline
Now we compute the diagnostic for the reduced model:
```{r, fig.width=8, fig.height=5, warning=FALSE}
## Reduced model HMC diagnostic
check_hmc_diagnostics(resStanRed)
```
Also for the reduced model none of the iterations ended with a divergence nor saturated the maximum tree depth.
### ESS
Effective sample size (ESS) measures the amount by which autocorrelation within the chains increases uncertainty in estimates. \newline
As for the Rhat values we can directly observe the effective sample size values of the chains using the command \texttt{print(fit)}, already used. We can see that the sample size values are all sufficiently high for both model.
## Model comparison
In order to develop a precise posterior predictive checking and a model comparison, it was decided to use the built-in Stan function \texttt{stan\_glm}.
This was mainly done for treatability reason. As a matter of fact, with the defined function it is easier to create different Stan models, add priors and genearate new sample from the posterior. \newline
As it is possible to understand from the R-code, which follows, the model comparison was developed designing models with different families of distribution.
```{r echo=FALSE, results = 'hide'}
## Different Stan models, testing different distribution families and piors
datSetCluster <- na.omit(datSetCluster)
# Define null and full model
model.null <- stan_glm(as.numeric(Died) ~ 1,
data = datSetCluster,
family = binomial(link = 'logit'))
model.full <- stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) + as.numeric(Season),
data = datSetCluster,
family = binomial(link = "logit"),
prior = normal(0,10),
prior_intercept = NULL,
QR = TRUE,
chains = 5,
iter = 2000,
warmup = 1000
)
model.reduced = stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method),
data = datSetCluster,
family = binomial(link = 'logit'),
prior = normal(0,10),
prior_intercept = NULL,
chains = 5,
iter = 2000,
warmup = 1000,
QR = TRUE,
adapt_delta = 0.99)
model.normal = stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) + as.numeric(Season),
data = datSetCluster,
family = gaussian,
prior = normal(0.10),
QR = TRUE,
chains = 5,
iter = 2000,
warmup = 1000,
adapt_delta = 0.99
)
model.poisson = stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) + as.numeric(Season),
data = datSetCluster,
family = poisson(link = "log"),
prior = normal(0,5),
prior_intercept = NULL,
QR = TRUE,
chains = 5,
iter = 2000,
warmup = 1000,
adapt_delta = 0.99
)
```
```{r}
cat('\n Summary of null model \n')
summary(model.null)
cat('\n Summary of full model \n')
summary(model.full)
cat('\n Summary of reduced model \n')
summary(model.reduced)
cat('\n Summary of normal model \n')
summary(model.normal)
cat('\n Summary of Possion model \n')
summary(model.poisson)
```
In this section the implemented models are compared using leave-one-out cross-validation.
```{r}
loo.null <- loo(model.null, cores = 4)
loo.full <- loo(model.full, cores = 4)
loo.reduced <- loo(model.reduced, cores = 4)
loo.normal <- loo(model.normal, cores = 4)
loo.possion <- loo(model.poisson, cores = 4)
compar_models <- loo_compare(loo.null, loo.full, loo.reduced, loo.normal, loo.possion)
compar_models
```
As it is possible to see from the analysis done above, the model that performs better is the one with all the parameters and in which the family distribution is the binomial one.
This reflects our initial expectations. As a matter of fact, having the predictive data a binomial outcome, use that family of distributions for the model fitting will give the best results.
Moreover, at the beginning we though that the model could suffer for overfitting using all the parameters.
However, the analysis performed, showed us that using all the parameters it is possible to obtain the best fitting.
## Sensitivity analysis
At this point, it becomes useful to develop a sensitivity analysis over the prior using the full model with the binomial distribution for the likelihood, i.e. the one that performed better, in order to understand what are the prior that allow to improve the posterior distribution without shifting it from the mean value.\newline
Actually, the prior choices that were made are described below:
\begin{itemize}
\item \textbf{Uniform Prior}
\item \textbf{Normal}: $N(0,5)$
\item \textbf{Student}: $Student_t(1,0,2.5)$
\item \textbf{Cauchy}: $Cauchy(0,4)$
\end{itemize}
```{r echo = FALSE, results = 'hide'}
# Different weakly informative prior choices with the full model
datSetCluster <- na.omit(datSetCluster)
model.full.uniform <- stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
as.numeric(Season),
data = datSetCluster,
family = binomial(link = "logit"),
prior = NULL,
prior_intercept = NULL,
QR = TRUE,
chains = 5,
iter = 2000,
warmup = 1000
)
model.full.normal <- stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
as.numeric(Season),
data = datSetCluster,
family = binomial(link = "logit"),
prior = normal(0,5),
prior_intercept = NULL,
QR = TRUE,
chains = 5,
iter = 2000,
warmup = 1000
)
model.full.student <- stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
as.numeric(Season),
data = datSetCluster,
family = binomial(link = "logit"),
prior = student_t(1,0,2.5),
prior_intercept = NULL,
QR = TRUE,
chains = 5,
iter = 2000,
warmup = 1000
)
model.full.cauchy <- stan_glm(
as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
as.numeric(Season),
data = datSetCluster,
family = binomial(link = "logit"),
prior = cauchy(0,4),
prior_intercept = NULL,
QR = TRUE,
chains = 5,
iter = 2000,
warmup = 1000
)
```
```{r warning=FALSE}
# Density plots
dens.uniform <- stan_dens(model.full.uniform, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.uniform + ggtitle('Plots for uniform prior')
dens.normal <- stan_dens(model.full.normal, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.normal + ggtitle('Plots for normal prior')
dens.student <- stan_dens(model.full.student, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.student + ggtitle('Plots for student prior')
dens.cauchy <- stan_dens(model.full.cauchy, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.cauchy + ggtitle('Plots for cauchy prior')
# Trace plots
trace.uniform <- stan_trace(model.full.uniform, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.uniform + scale_color_brewer(type = 'div') + ggtitle('Plots for uniform prior')
trace.normal <- stan_trace(model.full.normal, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.normal + scale_color_brewer(type = 'div') + ggtitle('Plots for normal prior')
trace.student <- stan_trace(model.full.student, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.student + scale_color_brewer(type = 'div') + ggtitle('Plots for student prior')
trace.cauchy <- stan_trace(model.full.cauchy, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.cauchy + scale_color_brewer(type = 'div') + ggtitle('Plots for cauchy prior')
```
Observing the distribution above, it is clear that, even if the prior changes, the posterior distribution will remain almost the same. As a matter of fact, there are only small differences among the analysed parameters for all the models.
This can be seen looking at both the plots of the distributions and to the trace plots of the chains.
Moreover, all the models perform in a good way, giving convergence for all the cases.
Bolow, the plots for Rhat and ESS will be proposed.
As said above, the convergence is achieved, in fact the Rhat values are all below 1.05, that is the limit for the convergence.
The ratio for the ESS is close to 1 too.
```{r}
# Rhat histogram plots
rhat.uniform <- stan_rhat(model.full.uniform, bins = 50)
rhat.normal <- stan_rhat(model.full.normal, bins = 50)
rhat.student <- stan_rhat(model.full.student, bins = 50)
rhat.cauchy <- stan_rhat(model.full.cauchy, bins = 50)
uniform <- data.frame(rhat = rhat.uniform$data$stat)
normal <- data.frame(rhat = rhat.normal$data$stat)
student <- data.frame(rhat = rhat.student$data$stat)
cauchy <- data.frame(rhat = rhat.cauchy$data$stat)
uniform$distr <- 'uniform'
normal$distr <- 'normal'
student$distr <- 'student'
cauchy$distr <- 'cauchy'
distrLength <- rbind(uniform, normal, student, cauchy)
ggplot(distrLength, aes(rhat, fill = distr)) + geom_histogram(alpha = 0.5, aes(y = ..density..), bins = 50)+ scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2")
```
```{r}
# Ess histogram plots
ess.uniform <- stan_ess(model.full.uniform, bins = 50)
ess.normal <- stan_ess(model.full.normal, bins = 50)
ess.student <- stan_ess(model.full.student, bins = 50)
ess.cauchy <- stan_ess(model.full.cauchy, bins = 50)
uniform <- data.frame(ratio_ess = ess.uniform$data$stat)
normal <- data.frame(ratio_ess = ess.normal$data$stat)
student <- data.frame(ratio_ess = ess.student$data$stat)
cauchy <- data.frame(ratio_ess = ess.cauchy$data$stat)
uniform$distr <- 'uniform'
normal$distr <- 'normal'
student$distr <- 'student'
cauchy$distr <- 'cauchy'
distrLength <- rbind(uniform, normal, student, cauchy)
ggplot(distrLength, aes(ratio_ess, fill = distr)) + geom_histogram(alpha = 0.5, aes(y = ..density..), bins = 50)+ scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2")
```
```{r}
# Leave-one-out cross-validation over the models with different priors
loo.uniform <- loo(model.full.uniform, cores = 4)
loo.normal <- loo(model.full.normal, cores = 4)
loo.student <- loo(model.full.student, cores = 4)
loo.cauchy <- loo(model.full.cauchy, cores = 4)
compar_models_prior <- loo_compare(loo.uniform, loo.normal, loo.student, loo.cauchy)
compar_models_prior
```
Observing the results of the loo-cv comparison, the models have almost the same performance. As a matter of fact the elpd indexes differ only slightly.
Besides this, the model that has the best performance is the one that uses a Cauchy distrbution as prior. \newline
Therefore, in order to develop the predictive checking that model was used.
## Predictive checking
Observing the values of the relative elpd among the model tested in this section, the one that performs better is the one with a cauchy distribution as prior.
```{r, eval=FALSE, echo=TRUE, include=TRUE}
y <- as.numeric(datSetCluster$Died)
y_tilde <- posterior_predict(model.full.cauchy, draws = 500)
color_scheme_set("brightblue")
ppc_dens_overlay(y, y_tilde[1:50, ])
```
Observing the plot that results from the predictive checking, the generated values reflect the one of the initial posterior.
Thus, noticing this, it is possible to conclude that the model works fine.
## Age prediction
To conclude we want to predict the most probable age of the attempters. A full logistic regression model is implemented with a normal prior over the standard deviation $\sigma \sim N(0,10)$, its convergence and efficiency verified and the posterior prediction generated using a prior over the mean $\mu_p \sim N(60, 10)$.
```{r warning=FALSE}
data_4_fit_complete <- list(N = nrow(datSet),
p = 10,
age = as.numeric(datSet$Age),
died = as.numeric(datSet$Died),
hosp = as.numeric(datSet$Hospitalised),
year = as.numeric(datSet$Year),
sex = as.numeric(datSet$Sex),
job = as.numeric(datSet$Occupation),
urban = as.numeric(datSet$Urban),
edu = as.numeric(datSet$Education),
method = as.numeric(datSet$method),
season = as.numeric(datSet$Season))
fileName <- "./stan_model_prior_all_params.stan"
stan_code_complete <- readChar(fileName, file.info(fileName)$size)
cat(stan_code_complete)
# Run Stan
fitStan_complete <- stan(model_code = stan_code_complete,
data = data_4_fit_complete,
chains = 5,
iter = 2000,
warmup = 800,
thin = 10,
refresh = 0,
seed = 12345,
control = list(adapt_delta = 0.95),
refresh = 0)
print(fitStan_complete, pars = c('beta', 'sigma','predict_age'))
```
```{r}
posterior <- as.array(fitStan_complete)
color_scheme_set("brightblue")
mcmc_dens_overlay(posterior, pars = c("beta[1]", "predict_age")) + ggtitle("Posterior drawns and predictive posterior")
```
\clearpage
# Conclusions
The most infulential factors related to SSAs are occupation, whether the individual lives in a rural or an urban environment, age and education level. However we have also noticed that the remaining factors are important in order to build a descriptive model of the phenomenon, since the full model showed to be consistently better thatn the reduced one. \newline
The age prediction confirmed the results of the analysis carried out by the authors of the paper we refferred in our project, giving an average value of 65 years of age.
## Problems encountered
Given the complexity of the structure of the problem, a lot of work has been required in order to understand what model would suit our analysis best and in what way it was necessary to process the data. Furthermore it has revealed really complex to carry out a meaningful analysis and to interpret the results. This effort resulted however in a great deal of experience with this kind of models and structure and gave us a much better intuition of the topics we had seen during lectures and we had applied in the assignments.
## Potential improvements
Based on the data at our disposal all relevant pieces of analysis were carried out based on standard multiple logistic regression model. In terms of bayesian reasoning, in order to improve the analysis a more complex model could be developed, like a model with a hierarchical structure, and longer exloration of prior distributions (and of hyperpriors) could be carried out.