definitive_project.Rmd

---
title: "BDA - Project Work"
author: "Jacopo Losi, Nicola Saljoughi"
output:
  pdf_document:
    toc: yes
    toc_depth: 3
  word_document:
    toc: yes
    toc_depth: '1'
  html_document:
    df_print: paged
    toc: yes
    toc_depth: '1'
---
```{r setup, include=FALSE}
# This chunk just sets echo = TRUE as default (i.e. print all code)
knitr::opts_chunk$set(echo = TRUE, tidy = FALSE)

library(aaltobda)
library(arules)
library(bayesplot)
library(brms)
library(devtools)
library(dplyr)
library(easyGgplot2)
library(epitools)
library(ggplot2)
library(KernSmooth)
library(loo)
library(magrittr)
library(MASS)
library(mvtnorm)
library(nnet)
library(phonTools)
library(rcompanion)
library(rstan)
library(rstanarm)
library(shinystan)
library(tableone)
library(tinytex)

```


\clearpage

# Introduction
This project is based on a study carried out in 2015 by a group of researchers to estimate the incidence of serious suicide attempts in Shandong, China, and to examine the factors associated with fatality among the attempters. \newline
We have chosen to examine a dataset on suicides because it is a really important but often underconsidered problem in today's society. Not only this problem reflects a larger problem in a country societal system but it can also be a burden for hospital resources. We think that by being able to talk about it more openly and by truly trying to estimate its size and impact we can start to understand where the causes are rooted and what can be done to fight it. \newline
We invite the reader to check the source section to further read about the setting and results of the named paper. \newline
In this report we will carry out our analysis following the bayesian approach. Since also the frequentist approach was covered during lecture, we though that it was meaningful to compare the two at the begininning of the analysis.\newline
Adopting then the bayesian approach we will first develop a multiple logistic regression model using all the variables, after that we will do variable selection to determine which are the most influential factors and develop a second multiple logistic regression model using the selected variables. After that, we will assess the convergence and efficiency of the models, do posterior predictive checking and compare the models. To conclude we carry out a prediction on the age of the attempters and eventually answer our analysis problem. 


## Analysis Problem

The objective of the project is to use the bayesian approach to develop models to evaluate the most influential factors related to serious suicide attempts (SSAs, defined as suicide attempts resulting in either death or hospitalisation) and being able to make predictions on the age of the attempters. 


## Data 
Data from two independent health surveillance systems were linked, constituted by records of suicide deaths and hospitalisations that occured among residents in selected countries during 2009-2011.  
The data set is constituted by 2571 observations of 11 variables:
\begin{itemize}
  \item \texttt{Person\_ID}: ID number, $1,...,2571$
  \item \texttt{Hospitalised}: \textit{yes} or \textit{no}
  \item \texttt{Died}: \textit{yes} or \textit{no}
  \item \texttt{Urban}: \textit{yes}, \textit{no} or \textit{unknown}
  \item \texttt{Year}: $2009$, $2010$ or $2011$
  \item \texttt{Month}: $1,...,12$
  \item \texttt{Sex}: \textit{female} or \textit{male}
  \item \texttt{Age}: years
  \item \texttt{Education}: \textit{iliterate}, \textit{primary}, \textit{Secondary}, \textit{Tertiary} or \textit{unknown}
  \item \texttt{Occupation}: one of ten categories
  \item \texttt{method}: one of nine methods
\end{itemize}
It is important to notice that the population in the study is predominantly rural and that the limitation of the study is that the incidence estimates are likely to be underestimated due to underreporting in both surveillance systems. 


## Source

Sun J, Guo X, Zhang J, Wang M, Jia C, Xu A (2015) "Incidence and fatality of serious suicide attempts in a predominantly rural population in Shandong, China: a public health surveillance study," BMJ Open 5(2): e006762. https://doi.org/10.1136/bmjopen-2014-006762

Data downloaded via Dryad Digital Repository. https://doi.org/10.5061/dryad.r0v35


\clearpage


# Analysis
The analysis is structures as follows:
\begin{itemize}
  \item \textbf{Bayesian vs Frequentist}: we will compare the frequentist and the bayesian approach by developing two multiple logistic regression models and assigning to the bayesian model the default flat priors in Stan;
  \item \textbf{Parameter Selection}: we will evaluate the most influential factors on the fatality of the SSAs and their correlation to develop the reduced logistic regression model in the following analysis;
  \item \textbf{Full and reduced logistic regression models}: after the parameters selection, we develop a first model comparison between the full logistic regression model and the reduced one (the one where we only considered the factors selected in the parameter selection phase);
  \item \textbf{Convergence and efficiency analysis}: convergence and efficiency of the models have been assessed using Rhat and ESS respectively and also Hamiltonian Monte Carlo (HMC) specific diagnostic have been computed on both of the models(tree depth and divergence);
  \item \textbf{Model comparison}: different models, designed using different families of distributions, have been compared by using leave-one-out cross-validation and the best model has been selected for the remaining analysis; 
  \item \textbf{Sensitivity analysis}: different priors have been tested on the best model selected in the previous phase, the convergence and efficiency of the resulting models assessed and these compared using loo-cv;
  \item \textbf{Predictive checking}: the best model selected in the sensitivity analysis has been used to carry out a predictive checking;
  \item \textbf{Age prediction}: to conclude we decided to choose one of the most influential parameters to generate a prediction and answer our analysis problem.
\end{itemize}

## Data prepreocessing
```{r}

datFile <- "suicide attempt data_2.csv"
datCsv <- read.csv(datFile, stringsAsFactors=FALSE)
datSet <- as.data.frame(datCsv)

datSet$Season <- datSet$Month
datSet$Month = NULL

## Remove unknown labels

indexUnkn_1 <- which(datSet$Education == 'unknown')
indexUnkn_2 <- which(datSet$Urban == 'unknown')
indexUnkn_3 <- which(datSet$Occupation == 'others/unknown')
datSet <- datSet[-c(indexUnkn_1, indexUnkn_2,indexUnkn_3),]


# Hospitalised
indexHosp   <- which(datSet$Hospitalised == 'yes')
indexNoHosp <- which(datSet$Hospitalised == 'no')
datSet$Hospitalised[indexHosp]   <- 1    # 1 --> yes
datSet$Hospitalised[indexNoHosp] <- 0    # 0 --> no


# Died
indexDied   <- which(datSet$Died == 'yes')
indexNoDied <- which(datSet$Died == 'no')
datSet$Died[indexDied]   <- 1    # 1 --> yes
datSet$Died[indexNoDied] <- 0    # 0 --> no


# Urban
indexUrban   <- which(datSet$Urban == 'yes')
indexNoUrban <- which(datSet$Urban == 'no')
datSet$Urban[indexUrban]   <- 1    # 1 --> yes
datSet$Urban[indexNoUrban] <- 0    # 0 --> no

#Year
indexYear2009 <- which(datSet$Year == 2009)
indexYear2010 <- which(datSet$Year == 2010)
indexYear2011 <- which(datSet$Year == 2011)
datSet$Year[indexYear2009] <- 1    # 1 --> 2009
datSet$Year[indexYear2010] <- 2    # 2 --> 2010
datSet$Year[indexYear2011] <- 3    # 3 --> 2011

# Sex
indexMale   <- which(datSet$Sex == 'male')
indexFemale <- which(datSet$Sex == 'female')
datSet$Sex[indexMale]   <- 1    # 1 --> male
datSet$Sex[indexFemale] <- 0    # 0 --> female

# Education
indexEduZero  <- which(datSet$Education == 'iliterate') 
indexEduOne   <- which(datSet$Education == 'primary') 
indexEduTwo   <- which(datSet$Education == 'Secondary')
indexEduThree <- which(datSet$Education == 'Tertiary')

datSet$Education[indexEduZero]   <- 0   # 0 --> iliterate
datSet$Education[indexEduOne]    <- 1   # 1 --> primary
datSet$Education[indexEduTwo]    <- 2   # 2 --> Secondary
datSet$Education[indexEduThree]  <- 3   # 3 --> Tertiary


# Occupation
indexUnEmpl <- which(datSet$Occupation == 'unemployed')
indexFarm   <- which(datSet$Occupation == 'farming')
indexProf   <- which(datSet$Occupation == 'business/service' | datSet$Occupation == 'professional' | datSet$Occupation == 'worker')

datSet$Occupation[indexUnEmpl]                            <- 0   # 0 --> unemployed                          
datSet$Occupation[indexFarm]                              <- 1   # 1 --> farming
datSet$Occupation[indexProf]                              <- 2   # 2 --> professional and worker
datSet$Occupation[-c(indexUnEmpl, indexFarm, indexProf)]  <- 3   # 3 --> others

# Method 
indexPesticide <- which(datSet$method == 'Pesticide')
indexPoison    <- which(datSet$method == 'Other poison')
indexHanging   <- which(datSet$method == 'hanging')
indexOthers    <- which(datSet$method != 'Pesticide' &
                        datSet$method != 'Other poison' &
                        datSet$method != 'hanging')

datSet$method[indexPesticide] <- 1 # 1 --> Pesticide
datSet$method[indexPoison]    <- 2 # 2 --> Other poison  
datSet$method[indexHanging]   <- 3 # 3 --> hanging
datSet$method[indexOthers]    <- 4 # 4 --> All others

# Season
indexSpring <- which(datSet$Season >= 3 & datSet$Season <= 5)
indexSummer <- which(datSet$Season >= 6 & datSet$Season <= 8)
indexAutumn <- which(datSet$Season >= 9 & datSet$Season <= 11)
indexWinter <- which(datSet$Season == 12 | datSet$Season <= 2)

datSet$Season[indexSpring] <- 1  # 1 --> Spring
datSet$Season[indexSummer] <- 2  # 2 --> Summer
datSet$Season[indexAutumn] <- 3  # 3 --> Autumn
datSet$Season[indexWinter] <- 4  # 4 --> Winter

datSetCluster <- datSet

# Age
indexAgeOne   <- which(datSet$Age <= 34) 
indexAgeTwo   <- which(datSet$Age >= 35 & datSet$Age <= 49)
indexAgeThree <- which(datSet$Age >= 50 & datSet$Age <= 64)
indexAgeFour  <- which(datSet$Age >= 65)

datSetCluster$Age[indexAgeOne]   <- 1   # 1 --> <34
datSetCluster$Age[indexAgeTwo]   <- 2   # 2 --> 35-49
datSetCluster$Age[indexAgeThree] <- 3   # 3 --> 50-64
datSetCluster$Age[indexAgeFour]  <- 4   # 4 --> >65

```


## Bayesian vs. Frequentist


As already said, at first we evaluated a comparison between the frequentist and the bayesian approach, in order to provide an analysis of the dataset with both models, being the two approaches discussed during the lectures.

### Model description
In order to evaluate the factors which influence the probability of SSA the most it is an obvious chioice to develop a multiple logistic regression model.

### Prior choices
For the first bayesian model  we have assumed flat prior.

Thus, the two model follow below.

### Frequentist approach

```{r}

freqModel <- glm(as.numeric(Died) ~ as.numeric(Urban) + 
                                    as.numeric(Year) + 
                                    as.numeric(Season) + 
                                    as.numeric(Sex) + 
                                    as.numeric(Age) + 
                                    as.numeric(Education) + 
                                    as.numeric(Occupation) + 
                                    as.numeric(method), 
                data = datSetCluster,
                family = binomial(link = "logit"))
summary(freqModel)

```

### Bayesian approach using Stan

```{r}

## Create Stan data
datFullBayes <- list(N      = nrow(datSetCluster),
                     p      = ncol(datSetCluster) - 2,
                     died   = as.numeric(datSetCluster$Died),
                     urban  = as.numeric(datSetCluster$Urban),
                     year   = as.numeric(datSetCluster$Year),
                     season = as.numeric(datSetCluster$Season),
                     sex    = as.numeric(datSetCluster$Sex),
                     age    = as.numeric(datSetCluster$Age),
                     edu    = as.numeric(datSetCluster$Education),
                     job    = as.numeric(datSetCluster$Occupation),
                     method = as.numeric(datSetCluster$method))

## Load Stan file

fileName <- "./logistic_regression_model.stan"
stanCodeFull <- readChar(fileName, file.info(fileName)$size)
cat(stanCodeFull)

```

```{r echo=FALSE}
## SIMPLE LOGISTIC REGRESSION MODEL

# Run Stan
resStanFull <- stan(model_code = stanCodeFull,
                    data = datFullBayes,
                    chains = 5, 
                    iter = 2000, 
                    warmup = 800,
                    thin = 10,
                    refresh = 0,
                    seed = 12345,
                    control = list(adapt_delta = 0.95))

```

```{r}

traceplot(resStanFull, pars = c('beta[3]','beta[4]', 'beta[5]', 
                                'beta[6]', 'beta[7]', 'beta[8]',
                                'beta[9]'), inc_warmup = TRUE)
```


### Comparison between frequentist and bayesian approach

```{r}
## Bayesian
print(resStanFull, pars = c('beta'))

## Frequentist
tableone::ShowRegTable(freqModel, exp = FALSE)

```

After this first analysis, as aforementioned, we decided to use the Bayesian approach.


Therefore, regarding the analysis that will follow, we first developed two models to evaluate the incidence of the factors on fatality of attempts:

\begin{itemize}
  \item \textbf{Full logistic regression model} where all the parameters are included, as the one shown before;
  \item \textbf{Reduced logistic regression model} that only includes the parameters selected in the variable selection phase. 
\end{itemize}

At the end of the analysis we have devoloped two other models to predict the age of the attempters. These two models correspond to a full model with all the parameters and a reduced one with only the most relevant parameters once again. \newline


## Parameter selection

### Data loading 

First of all we load the data. Notice that some processing was done on the original data removing samples with missing entries (that resulted to constitute less than the 6 % of the dataset) and turning labels from strings into integers. 
```{r}

## Create Stan data
datFull <- list(N      = nrow(datSetCluster),
                p      = ncol(datSetCluster) - 2,
                died   = as.numeric(datSetCluster$Died),
                urban  = as.numeric(datSetCluster$Urban),
                year   = as.numeric(datSetCluster$Year),
                season = as.numeric(datSetCluster$Season),
                sex    = as.numeric(datSetCluster$Sex),
                age    = as.numeric(datSetCluster$Age),
                edu    = as.numeric(datSetCluster$Education),
                job    = as.numeric(datSetCluster$Occupation),
                method = as.numeric(datSetCluster$method))

```

In this phase we are working on testing different models, therefore it is worth to take only some random samples from the data. As a matter of fact, the dataset that we have is big and thus the computation on the whole dataset will take a lot of time.

Therefore, we will proceed as follows:
* we will generate a vector of 50 random number taken from our dataset;
* we will test the models with this data, that are sufficient for not loosing in generality;
* we will run the final model on the whole dataset.

```{r}

random_index <- sample(datSetCluster$Person_ID, size = 50, replace = TRUE)

data_reduced <- datSetCluster[random_index, ]
data_reduced <- na.omit(data_reduced)
```

```{r}

## Create Stan data
dat_red <- list(N    = nrow(data_reduced),
                p        = ncol(data_reduced) - 2,
                died     = as.numeric(data_reduced$Died),
                urban    = as.numeric(data_reduced$Urban),
                year     = as.numeric(data_reduced$Year),
                season   = as.numeric(data_reduced$Season),
                sex      = as.numeric(data_reduced$Sex),
                age      = as.numeric(data_reduced$Age),
                edu      = as.numeric(data_reduced$Education),
                job      = as.numeric(data_reduced$Occupation),
                method   = as.numeric(data_reduced$method))

```


### Full logistic regression model
Here we start by implementing the full logistic regression model.  

```{r}
## FULL LOGISTIC REGRESSION MODEL

## Load Stan Model
fileNameOne <- "./logistic_regression_model.stan"
stan_code_full <- readChar(fileNameOne, file.info(fileNameOne)$size)
cat(stan_code_full)

```

### Stan Code Running 
The Stan models are run by using five chains constituted by 2000 iterations, a warmup length of 800 iterations and a thin equal to 10. 
Thin is a positive integer that specifies the period for saving samples; it is set by default = 1, and it is normally left to defaults. In our case though our posterior distribution takes up a lot of memory even when using a reduced dataset and we require a large numer of iteration to achieve effective sample size and therefore we decide to set it to 10 in this phase. 

```{r echo=FALSE}
## SIMPLE LOGISTIC REGRESSION MODEL

# Run Stan
resStanFull <- stan(model_code = stan_code_full,
                data = dat_red,
                chains = 5, 
                iter = 2000, 
                warmup = 800,
                thin = 10,
                refresh = 0,
                seed = 12345,
                control = list(adapt_delta = 0.95))
print(resStanFull, pars = c('beta'))

```


### Variable selection 
In this section we will evaluate the most influential factors and their correlation in order to select the most descriptive ones that will be used to contruct our second model (the reduced logistic regression model).\newline

First of all we process our data:
```{r}
# Transform fitting over beta in a dataframe for the plots

beta_matrix <- zeros(length(extract(resStanFull)$beta[,1]), ncol(data_reduced) - 2)

for (i in 1:ncol(data_reduced) - 2)
  beta_matrix[,i] = beta_matrix[,i] + extract(resStanFull)$beta[,i]

beta_df <- as.data.frame(beta_matrix)
```

Now we show traceplots and generate scatter plots in order to evaluate the correlation between the parameters:

```{r}
# Generate some scatter plots in order to see the correlations between parameters
scatter_1 <- ggplot(beta_df, aes(x=V3, y=V7)) +
                    ggtitle("Correlation between location and education") +
                    xlab("Urban") + ylab("Education") +
             geom_point(size=1, shape=23) +
             geom_smooth(method=lm, linetype="dashed", color="darkred", fill="blue")

scatter_2 <- ggplot(beta_df, aes(x=V3, y=V8)) +
                    ggtitle("Correlation between location and occuption") +
                    xlab("Urban") + ylab("Occupation") +
             geom_point(size=1, shape=23) +
             geom_smooth(method=lm, linetype="dashed", color="darkred", fill="blue")

scatter_3 <- ggplot(beta_df, aes(x=V5, y=V6)) +
                    ggtitle("Correlation between gender and age") +
                    xlab("Gender") + ylab("Age") +
             geom_point(size=1, shape=23) +
             geom_smooth(method=lm, linetype="dashed", color="darkred", fill="blue")


ggplot2.multiplot(scatter_1,scatter_2,scatter_3, cols=1)
```

Now we overlay histogram, density and mean value of the parameters. The most interesting plots are presented; using the mean value is interesting since we can understand which are the parameters that influence more the posterior. 
Thus, looking at the weight of the parameters in the histograms, it is possible to suppose with enough precision which are the most informative parameters. 

```{r}

plot_1 <- qplot(extract(resStanFull)$beta[,3], geom = 'blank',
                xlab = 'Values of weigth', ylab = 'Occurences', main='Urbans') +   
  geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) + 
  geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
  geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,3])), col=I('yellow'), linetype="dashed", size=1) 

plot_2 <- qplot(extract(resStanFull)$beta[,5], geom = 'blank',
                xlab = 'Values of weigth', ylab = 'Occurences', main='Sex') +   
  geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) + 
  geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
  geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,5])), col=I('yellow'), linetype="dashed", size=1)

plot_3 <- qplot(extract(resStanFull)$beta[,6], geom = 'blank',
                xlab = 'Values of weigth', ylab = 'Occurences', main='Age') +   
  geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) + 
  geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
  geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,6])), col=I('yellow'), linetype="dashed", size=1)

plot_4 <- qplot(extract(resStanFull)$beta[,7], geom = 'blank',
                xlab = 'Values of weigth', ylab = 'Occurences', main='Education') +   
  geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) + 
  geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
  geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,7])), col=I('yellow'), linetype="dashed", size=1)

plot_5 <- qplot(extract(resStanFull)$beta[,8], geom = 'blank',
                xlab = 'Values of weigth', ylab = 'Occurences', main='Occupation') +   
  geom_histogram(aes(y = ..density..),col = I('red'), bins = 50) + 
  geom_line(aes(y = ..density..), size = 1, col = I('blue'), stat = 'density', ) +
  geom_vline(aes(xintercept=mean(extract(resStanFull)$beta[,8])), col=I('yellow'), linetype="dashed", size=1) 

ggplot2.multiplot(plot_1,plot_2,plot_3,plot_4, plot_5, cols=3)

```

From the analysis done above, and especially looking at the histogram, it is clear that the most important parameters that count in our analysis are: the fact that the people come from urban or rural areas, then their education, occupation and partially if they are man or woman. As a matter of fact, the mean and the maximum values of the coeffcient related to those paramters have the bigger magnitude. This means that those parameters are weighted more in the multi regression function in the model.

Therefore, for further analysis, it will be good to develop specific analysis using only these parameters, in order to have a more precise evalution considering only the most relevant parameters. 


### Full logistic regression model 
For completeness we report here the full model once again. 
```{r}
## FULL LOGISTIC REGRESSION MODEL

## Load Stan Model
fileNameOne <- "./logistic_regression_model.stan"
stan_code_full <- readChar(fileNameOne, file.info(fileNameOne)$size)
cat(stan_code_full)

```


### Reduced logistic regression model
We can now implement the reduced logistic regression model using the selected parameters. 

```{r}
## REDUCED LOGISTIC REGRESSION MODEL

## Load Stan Model
fileNameOneDef <- "./logistic_regression_model_def.stan"
stan_code_simple_def <- readChar(fileNameOneDef, file.info(fileNameOneDef)$size)
cat(stan_code_simple_def)

```


### Stan Code Running
Now we are going to run the model on the full dataset. \newline
First we define the Stan data to run the second (reduced) model. 

```{r}

## Create Stan data
dat_def <- list(N        = nrow(datSetCluster),
                p        = 4,
                died     = as.numeric(datSetCluster$Died),
                sex      = as.numeric(datSetCluster$Sex),
                age      = as.numeric(datSetCluster$Age),
                edu      = as.numeric(datSetCluster$Education))

```

We first run the full model. The settings are the same as before except that now we are using the full dataset and default value for thin. 

```{r echo=FALSE}
## FULL LOGISTIC REGRESSION MODEL

# Run Stan
resStanFull <- stan(model_code = stan_code_full,
                data = datFull,
                chains = 5, 
                iter = 2000, 
                warmup = 800,
                thin = 1,
                refresh = 0,
                seed = 12345,
                control = list(adapt_delta = 0.95))
print(resStanFull, pars = c('beta'))

```

Now we run the reduced order model.

```{r echo=FALSE}
## REDUCED LOGISTIC REGRESSION MODEL

# Run Stan
resStanRed <- stan(model_code = stan_code_simple_def,
                  data = dat_def,
                  chains = 5, 
                  iter = 2000, 
                  warmup = 800,
                  thin = 1,
                  refresh = 0,
                  seed = 12345,
                  control = list(adapt_delta = 0.95))
print(resStanRed, pars = c('beta'))

```

## Convergence and efficiency analysis

In this section we are going to analyse the implemented models, both in terms of convergence (assessed using R-hat and HMC specific convergence diagnostic) and efficiency (by computing the Effective Sample Size). \newline


### R-hat

R-hat convergence diagnostic compares between- and within-chain estimates for model parameters and other univariate quantities of interest. If chains have not mixed well R-hat is larger than 1. In practical terms, it is good practice to use at least four chains and using the sample if R-hat is less than 1.05. \newline
We can see from the result of \texttt{print(fit)} we have just displayed that all the Rhat values are equal to one for both the models and therefore we have convergence. 


### HMC

Here we compute convergence diagnostic specific to Hamiltonian Monte Carlo, and in particular divergences and tree depth.\newline
The following code computes the diagnostic for the full model:

```{r, fig.width=8, fig.height=5, warning=FALSE}
## Full model HMC diagnostic

check_hmc_diagnostics(resStanFull)
```

As we can see none of the interations ended with a divergence nor saturated the maximum tree depth. \newline
Now we compute the diagnostic for the reduced model:

```{r, fig.width=8, fig.height=5, warning=FALSE}
## Reduced model HMC diagnostic

check_hmc_diagnostics(resStanRed)
```

Also for the reduced model none of the iterations ended with a divergence nor saturated the maximum tree depth. 

### ESS

Effective sample size (ESS) measures the amount by which autocorrelation within the chains increases uncertainty in estimates. \newline
As for the Rhat values we can directly observe the effective sample size values of the chains using the command \texttt{print(fit)}, already used. We can see that the sample size values are all sufficiently high for both model. 


## Model comparison

In order to develop a precise posterior predictive checking and a model comparison, it was decided to use the built-in Stan function \texttt{stan\_glm}.
This was mainly done for treatability reason. As a matter of fact, with the defined function it is easier to create different Stan models, add priors and genearate new sample from the posterior. \newline

As it is possible to understand from the R-code, which follows, the model comparison was developed designing models with different families of distribution. 


```{r echo=FALSE, results = 'hide'}
## Different Stan models, testing different distribution families and piors

datSetCluster <- na.omit(datSetCluster)

# Define null and full model

model.null <- stan_glm(as.numeric(Died) ~ 1,
                       data = datSetCluster,
                       family = binomial(link = 'logit'))

model.full <- stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) + as.numeric(Season),
  data = datSetCluster,
  family = binomial(link = "logit"),
  prior = normal(0,10),
  prior_intercept = NULL,
  QR = TRUE,
  chains = 5,
  iter = 2000,
  warmup = 1000
)

model.reduced = stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method),
  data = datSetCluster,
  family = binomial(link = 'logit'),
  prior = normal(0,10),
  prior_intercept = NULL,
  chains = 5,
  iter = 2000,
  warmup = 1000,
  QR = TRUE,
  adapt_delta = 0.99)
                       
model.normal = stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) + as.numeric(Season),
  data = datSetCluster,
  family = gaussian,
  prior = normal(0.10),
  QR = TRUE,
  chains = 5,
  iter = 2000,
  warmup = 1000,
  adapt_delta = 0.99
)

model.poisson = stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) + as.numeric(Season),
  data = datSetCluster,
  family = poisson(link = "log"),
  prior = normal(0,5),
  prior_intercept = NULL,
  QR = TRUE,
  chains = 5,
  iter = 2000,
  warmup = 1000,
  adapt_delta = 0.99
)
```

```{r}
cat('\n Summary of null model \n')
summary(model.null)
cat('\n Summary of full model \n')
summary(model.full)
cat('\n Summary of reduced model \n')
summary(model.reduced)
cat('\n Summary of normal model \n')
summary(model.normal)
cat('\n Summary of Possion model \n')
summary(model.poisson)

```


In this section the implemented models are compared using leave-one-out cross-validation. 

```{r}

loo.null    <- loo(model.null, cores = 4)
loo.full    <- loo(model.full, cores = 4)
loo.reduced <- loo(model.reduced, cores = 4)
loo.normal   <- loo(model.normal, cores = 4)
loo.possion <- loo(model.poisson, cores = 4)

compar_models <- loo_compare(loo.null, loo.full, loo.reduced, loo.normal, loo.possion)
compar_models
```

As it is possible to see from the analysis done above, the model that performs better is the one with all the parameters and in which the family distribution is the binomial one. 
This reflects our initial expectations. As a matter of fact, having the predictive data a binomial outcome, use that family of distributions for the model fitting will give the best results.

Moreover, at the beginning we though that the model could suffer for overfitting using all the parameters. 
However, the analysis performed, showed us that using all the parameters it is possible to obtain the best fitting.

## Sensitivity analysis

At this point, it becomes useful to develop a sensitivity analysis over the prior using the full model with the binomial distribution for the likelihood, i.e. the one that performed better, in order to understand what are the prior that allow to improve the posterior distribution without shifting it from the mean value.\newline
Actually, the prior choices that were made are described below:

\begin{itemize}
  \item \textbf{Uniform Prior}
  \item \textbf{Normal}: $N(0,5)$
  \item \textbf{Student}: $Student_t(1,0,2.5)$
  \item \textbf{Cauchy}: $Cauchy(0,4)$
\end{itemize}

```{r echo = FALSE, results = 'hide'}
# Different weakly informative prior choices with the full model

datSetCluster <- na.omit(datSetCluster)

model.full.uniform <- stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
  as.numeric(Season),
  data = datSetCluster,
  family = binomial(link = "logit"),
  prior = NULL,
  prior_intercept = NULL,
  QR = TRUE,
  chains = 5,
  iter = 2000,
  warmup = 1000
)

model.full.normal <- stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
  as.numeric(Season),
  data = datSetCluster,
  family = binomial(link = "logit"),
  prior = normal(0,5),
  prior_intercept = NULL,
  QR = TRUE,
  chains = 5,
  iter = 2000,
  warmup = 1000
)

model.full.student <- stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
  as.numeric(Season),
  data = datSetCluster,
  family = binomial(link = "logit"),
  prior = student_t(1,0,2.5),
  prior_intercept = NULL,
  QR = TRUE,
  chains = 5,
  iter = 2000,
  warmup = 1000
)

model.full.cauchy <- stan_glm(
  as.numeric(Died) ~ as.numeric(Urban) + as.numeric(Year) + as.numeric(Sex) + as.numeric(Age) + as.numeric(Education) + as.numeric(Occupation) + as.numeric(method) +
  as.numeric(Season),
  data = datSetCluster,
  family = binomial(link = "logit"),
  prior = cauchy(0,4),
  prior_intercept = NULL,
  QR = TRUE,
  chains = 5,
  iter = 2000,
  warmup = 1000
)


```


```{r warning=FALSE}

# Density plots
dens.uniform <- stan_dens(model.full.uniform, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.uniform + ggtitle('Plots for uniform prior') 
dens.normal <- stan_dens(model.full.normal, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.normal + ggtitle('Plots for normal prior')
dens.student <- stan_dens(model.full.student, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.student + ggtitle('Plots for student prior')
dens.cauchy <- stan_dens(model.full.cauchy, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
dens.cauchy + ggtitle('Plots for cauchy prior')


# Trace plots

trace.uniform <- stan_trace(model.full.uniform, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.uniform + scale_color_brewer(type = 'div') + ggtitle('Plots for uniform prior') 
trace.normal <- stan_trace(model.full.normal, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.normal + scale_color_brewer(type = 'div') + ggtitle('Plots for normal prior')
trace.student <- stan_trace(model.full.student, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.student + scale_color_brewer(type = 'div') + ggtitle('Plots for student prior')
trace.cauchy <- stan_trace(model.full.cauchy, pars = c('as.numeric(Urban)', 'as.numeric(Sex)', 'as.numeric(Age)', 'as.numeric(Education)', 'as.numeric(Occupation)'))
trace.cauchy + scale_color_brewer(type = 'div') + ggtitle('Plots for cauchy prior')

```

Observing the distribution above, it is clear that, even if the prior changes, the posterior distribution will remain almost the same. As a matter of fact, there are only small differences among the analysed parameters for all the models. 
This can be seen looking at both the plots of the distributions and to the trace plots of the chains. 
Moreover, all the models perform in a good way, giving convergence for all the cases.

Bolow, the plots for Rhat and ESS will be proposed. 
As said above, the convergence is achieved, in fact the Rhat values are all below 1.05, that is the limit for the convergence.
The ratio for the ESS is close to 1 too.


```{r}

# Rhat histogram plots

rhat.uniform <- stan_rhat(model.full.uniform, bins = 50)
rhat.normal <- stan_rhat(model.full.normal, bins = 50)
rhat.student <- stan_rhat(model.full.student, bins = 50)
rhat.cauchy <- stan_rhat(model.full.cauchy, bins = 50)

uniform <- data.frame(rhat = rhat.uniform$data$stat)
normal <- data.frame(rhat = rhat.normal$data$stat)
student <- data.frame(rhat = rhat.student$data$stat)
cauchy <- data.frame(rhat = rhat.cauchy$data$stat)

uniform$distr <-  'uniform'
normal$distr  <- 'normal'
student$distr <-  'student'
cauchy$distr  <- 'cauchy'

distrLength <- rbind(uniform, normal, student, cauchy)

ggplot(distrLength, aes(rhat, fill = distr)) + geom_histogram(alpha = 0.5, aes(y = ..density..), bins = 50)+ scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2")


```


```{r}

# Ess histogram plots

ess.uniform <- stan_ess(model.full.uniform, bins = 50)
ess.normal <- stan_ess(model.full.normal, bins = 50)
ess.student <- stan_ess(model.full.student, bins = 50)
ess.cauchy <- stan_ess(model.full.cauchy, bins = 50)

uniform <- data.frame(ratio_ess = ess.uniform$data$stat)
normal <- data.frame(ratio_ess = ess.normal$data$stat)
student <- data.frame(ratio_ess = ess.student$data$stat)
cauchy <- data.frame(ratio_ess = ess.cauchy$data$stat)

uniform$distr <-  'uniform'
normal$distr  <- 'normal'
student$distr <-  'student'
cauchy$distr  <- 'cauchy'

distrLength <- rbind(uniform, normal, student, cauchy)

ggplot(distrLength, aes(ratio_ess, fill = distr)) + geom_histogram(alpha = 0.5, aes(y = ..density..), bins = 50)+ scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2")


```

```{r}
# Leave-one-out cross-validation over the models with different priors

loo.uniform    <- loo(model.full.uniform, cores = 4)
loo.normal    <- loo(model.full.normal, cores = 4)
loo.student <- loo(model.full.student, cores = 4)
loo.cauchy   <- loo(model.full.cauchy, cores = 4)
compar_models_prior <- loo_compare(loo.uniform, loo.normal, loo.student, loo.cauchy)
compar_models_prior

```

Observing the results of the loo-cv comparison, the models have almost the same performance. As a matter of fact the elpd indexes differ only slightly. 
Besides this, the model that has the best performance is the one that uses a Cauchy distrbution as prior. \newline
Therefore, in order to develop the predictive checking that model was used. 

## Predictive checking

Observing the values of the relative elpd among the model tested in this section, the one that performs better is the one with a cauchy distribution as prior. 

```{r, eval=FALSE, echo=TRUE, include=TRUE}
y <- as.numeric(datSetCluster$Died)

y_tilde <- posterior_predict(model.full.cauchy, draws = 500)
color_scheme_set("brightblue")
ppc_dens_overlay(y, y_tilde[1:50, ])
```


Observing the plot that results from the predictive checking, the generated values reflect the one of the initial posterior. 
Thus, noticing this, it is possible to conclude that the model works fine.


## Age prediction

To conclude we want to predict the most probable age of the attempters. A full logistic regression model is implemented with a normal prior over the standard deviation $\sigma \sim N(0,10)$, its convergence and efficiency verified and the posterior prediction generated using a prior over the mean $\mu_p \sim N(60, 10)$.


```{r warning=FALSE}

data_4_fit_complete <- list(N = nrow(datSet),
                   p = 10,
                   age = as.numeric(datSet$Age),
                   died = as.numeric(datSet$Died),
                   hosp = as.numeric(datSet$Hospitalised),
                   year = as.numeric(datSet$Year),
                   sex = as.numeric(datSet$Sex),
                   job = as.numeric(datSet$Occupation),
                   urban = as.numeric(datSet$Urban),
                   edu = as.numeric(datSet$Education),
                   method = as.numeric(datSet$method),
                   season = as.numeric(datSet$Season))


fileName <- "./stan_model_prior_all_params.stan"
stan_code_complete <- readChar(fileName, file.info(fileName)$size)
cat(stan_code_complete)

# Run Stan
fitStan_complete <- stan(model_code = stan_code_complete,
                data = data_4_fit_complete,
                chains = 5, 
                iter = 2000, 
                warmup = 800,
                thin = 10,
                refresh = 0,
                seed = 12345,
                control = list(adapt_delta = 0.95),
                refresh = 0)
print(fitStan_complete, pars = c('beta', 'sigma','predict_age'))


```


```{r}
posterior <- as.array(fitStan_complete)

color_scheme_set("brightblue")
mcmc_dens_overlay(posterior, pars = c("beta[1]", "predict_age")) + ggtitle("Posterior drawns and predictive posterior")
```

\clearpage


# Conclusions

The most infulential factors related to SSAs are occupation, whether the individual lives in a rural or an urban environment, age and education level. However we have also noticed that the remaining factors are important in order to build a descriptive model of the phenomenon, since the full model showed to be consistently better thatn the reduced one. \newline
The age prediction confirmed the results of the analysis carried out by the authors of the paper we refferred in our project, giving an average value of 65 years of age.


## Problems encountered
Given the complexity of the structure of the problem, a lot of work has been required in order to understand what model would suit our analysis best and in what way it was necessary to process the data. Furthermore it has revealed really complex to carry out a meaningful analysis and to interpret the results. This effort resulted however in a great deal of experience with this kind of models and structure and gave us a much better intuition of the topics we had seen during lectures and we had applied in the assignments. 

## Potential improvements
Based on the data at our disposal all relevant pieces of analysis were carried out based on standard multiple logistic regression model. In terms of bayesian reasoning, in order to improve the analysis a more complex model could be developed, like a model with a hierarchical structure, and longer exloration of prior distributions (and of hyperpriors) could be carried out.