Cadavid_Nonis_Roghani_lab1_v2.Rmd

---
title: "An Exploratory Analysis of Cancer Incidence and Mortality to Identify High-Risk Communities and Improve Survival"
author: "Ramiro Cadavid, Pri Nonis, Payman Roghani"
date: "September 24, 2018"
output: pdf_document

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

##  Setup

```{r include=FALSE}
library(dplyr)
library(tidyr)
library(repr)
library(car)
library(fBasics)
library(xtable)
source("box_hist.R")
source("outliers.R")
options(repr.plot.width=5, repr.plot.height=5)
```


## Introduction

In this project our efforts are focused on the analysis of data included in the csv file provided, to primarily understand the potential relationship between different parameters and the incidences of cancer across counties in the US. The main objectives are:
>1. To understand factors that predict cancer mortality rate, with the ultimate aim of identifying communities for social interventions.
>2. To determine which interventions are likely to have the most impact.


### Cancer Data

```{r collapse=TRUE, comment=NA}
Cancer <- read.csv('cancer.csv', row.names = 1)
```

```{r collapse=TRUE, comment=NA, paged.print=TRUE}
summary(Cancer) #summary statistics
```

```{r ollapse = TRUE, comment=NA}
str(Cancer, strict.width = "wrap")
```

```{r collapse = TRUE, comment=NA}
colnames(Cancer)
cat("  \n")
print(paste0('Number of rows: ', nrow(Cancer)))
print(paste0('Number of columns: ', ncol(Cancer)))

```


The cancer.csv file contains 29 variables (30 columns, including the first one that only contains the row numbers) and 3,047 observations. Each observation (i.e. row) includes data for a county across the US. The variables are mostly numbers and integers, except for 2 that are factors (`binnedInc` and `Geography`). Below, we explain the variables in detail and provide our assessment of the quality of the data.  


#### Variables

* Cancer data:
    + `avgAnnCount`: The average number of new cancer cases per year per county for years 2009-2013
    + `popEst2015`: Estimated population by county 2015

* Economic status:
    + `medIncome`: Median income per county
    + `povertyPercent`: Percent of population below poverty line
    + `binnedInc`: ???

* Population age and gender:
    + `MedianAge`: Median age per county
    + `MedianAgeMale`: Median age among males per county
    + `MedianAgeFemale`: Median age among females per county  

* Location:
    + `Geography`: County, State names

* Marital status:
    + `PercentMarried`: Percentage of married population
    + `PctMarriedHouseholds`: Percentage of married households per county

* Education:
    + `PctNoHS18_24`: Percentage of 18-24 year old population with no high school education
    + `PctHS18_24`: Percentage of 18-24 year old population with high school education
    + `PctSomeCol18_24`: Percentage of 18-24 year old population with some college education
    + `PctBachDeg18_24`: Percentage of 18-24 year old population with bachelor's degree 
    + `PctHS25_Over`: Percentage of population above 25 years old with high school education
    + `PctBachDeg25_Over`: Percentage of population above 25 years old with bachelor's degree 

* Household size:
    + `AvgHouseholdSize`: Average household size per county

* Employment status:
    + `PctEmployed16_Over`: Percentage of population above 16 years old who have jobs
    + `PctUnemployed16_Over`: Percentage of population above 16 years old with no jobs

* Health insurance coverage:
    + `PctPrivateCoverage`: Percentage of the population with private insurance coverage
    + `PctEmpPrivCoverage`: percentage of the population with employer-sponsored insurance coverage
    + `PctPublicCoverage`: Percentage of the population with public insurance coverage

* Race:
    + `PctWhite`: Percentage of white population by county
    + `PctBlack`: Percentage of African-American population by county
    + `PctAsian`: Percentage of Asian population by county
    + `PctOtherRace`: Percentage of other races by county

* Birth and death rates: 
    + `BirthRate`: Birth rate per county
    + `deathRate`: Death rate per county       


#### Evaluation of Dataset and Variables

Based on the outputs from diagnostic and summary statistics functions, as well as further univariate analysis, using relevant charts, below we describe our evaluation of the dataset and its variables. Since definition of most variables was not provided to us, our first step was to ensure understanding of what exactly such variables represent. We also evaluated the data to identify potentially erroneous values, extreme outliers and variables that might require transformation.


* **Data time frame:**: While `avgAnnCount` represents statistics for 2009-2013, the population by county is for 2015 and other variables do not have date stamps. Ideally all variables should have been from the same time period.

* **`avgAnnCount` definition:**: There is no clear definition for incidence rate per county for the `avgAnnCount` variable. Since the sum of all values is 1,847,514 and that based on [Cancer.gov](https://www.cancer.gov/)) data the average number of cases for 2009-2013 is 1,617,144, we assume this variable represents the actual count of new cases. Therefore, in our analysis we created a new variable called `incidenceRate` to represent the incidence rate of cancer per 100,000 people per county to be able to compare the spread of new cancer cases in different geographical regions regarldess of the actual population of such regions.

```{r collapse = TRUE, comment=NA}
#caclulating the total for avgAnnCount to compare with offical reports by Cancer.gov
sum(Cancer$avgAnnCount)
```

Official Cancer Statistics, 2009-2013

|Year | New Cases |  Deaths|
|-----|-----|-------|
|2009 |	1,660,290 | 562,340 | 
|2010 |	1,529,560	| 569,490 |
|2011 |	1,596,670	| 571,950 |
2012 | 1,638,910	| 577,190 |
2013 | 1,660,290  |	580,350 |

Source: [Cancer.gov](https://www.cancer.gov/)

```{r collapse = TRUE, comment=NA}
#calculating the mean of the number of new cancer cases for years 2009-2013
#based on Cancer.gov data, in order to cofirm our assumption regarding avgAnnCount  
incidence_cancer_gv <- c(1660290, 1529560, 1596670, 1638910, 1660290)
mean(incidence_cancer_gv)

```


* **Anomaly in `avgAnnCount`:** Through our assessment, we noticed that the number of new cancer cases (`avgAnnCount`) for 6 counties were greater than those counties' populations (`popEst2015`). Looking at the 6 observations, we realized that the value assigned to `popEst2015` for all these 6 counties is exactly the same number (1962.667684). In fact there are a total of 206 counties that have exactly the same average number of new cases, which is probably an error in the dataset. We decided to replace all of them with NA in our analysis.

* **`Geography`:** We checked this variable to identify potential duplicates. Since the number of unique values in this column (3,047) is equal to the total number of observations, there can not be any duplicates in this column.

```{r collapse = TRUE, comment=NA}
#checking for potential dubplicates in this variable
length(unique(Cancer[["Geography"]]))
```


* **`binnedInc`:** This variable has 10 levels that seem arbitrary and have different bin sizes. It is not clear why the income bins have been defined this way. As a result, we decided to ignore it in our analysis.


* **Anomaly in `MedianAge`:** The maximum `MedianAge` shows a value of 624, which is clearly a wrong number. We actually identified a total of 30 values in this column that are above 100; therefore, we will replace such values with NA in our analysis.

```{r collapse = TRUE, comment=NA}
#checking the number of erroneous values
age_error = subset(Cancer, MedianAge > 100) 
nrow(age_error)
```


* **Anomaly in `AvgHouseholdSize`:** The minimum for `AvgHouseholdSize` is 0.0221, which does not make sense, since we do not expect a household size below 1. There are 61 values in this column that are below 1, which we will replace with NA in our analysis.  

```{r collapse = TRUE, comment=NA}
#checking the number of erroneous values
household_error = subset(Cancer, AvgHouseholdSize < 1) 
nrow(household_error)
```


* **`PctSomeCol18_24`:** 75% of values within this variable are NAs (2285 out 3047). Therefore, we decided to ignore this variable in our analysis.


* **`BirthRate`:** It is not clear what exactly this represents. Often, the birth rate is defined as childbirths per 1,000 people per year, but applying that to this variable would not give us the right number. For example in Los Angeles County with the population of 10,170,292, there were 124,641 live births in 2015 based on official reports, which translates into a birth rate of 12.25 (BR = (b / p) X 1,000). However, the birth rate in our data shows a value of 4.7 for this county, which is probably the ratio of women aged 15-50 years old who gave birth in 2015 as reported by [TownCharts](http://www.towncharts.com/California/Demographics/Los-Angeles-County-CA-Demographics-data.html)). As a result, we decided to ignore this variable in our analysis.

```{r collapse = TRUE, comment=NA}
#checking the BirthRate value for Los Angeles County
Cancer[1000,'BirthRate']
```


```{r collapse = TRUE, comment=NA}
#Calculating LA County birth rate based on official figures. Formula: BR = (b ÷ p) X 1,000
124641/10170292*1000
```


* **`deathRate`:** Based on our assessment, we believe this variable represents the number of deaths due to cancer per 100,000 population per county. For instance, we looked at the figure for Kings County, NY (173.6) and the number in our data is closer to the officially reported cancer death rate (140.3), as opposed to overall death rate (603.1). We also calculated the actual number of deaths per county (`deathRate` * `popEst2015` / 100000) and the total for these values, which is eaual to 525,347. This number is pretty close to the figure reported by [Cancer.gov](https://www.cancer.gov/) (589,430), further confirming our assumption regarding `deathRate`.


```{r collapse = TRUE, comment=NA}
#checking the deathRate for Kings County, NY
Cancer[388, 'deathRate']
```

    Kings County, NY statistics:
        2015 population: 2,673,000
        2015 death rate (per 100,000 population): 603.1
        2015 Cancer death rate (per 100,000 population): 140.3

Sources: [DATA USA](https://datausa.io/), 
         [NY State Dpt of Health](https://www.health.ny.gov/)


```{r collapse = TRUE, comment=NA}
#comparing total death count in our dataset with Cancer.gov stats 
Cancer$death_count <- Cancer$deathRate * Cancer$popEst2015/100000
sum(Cancer$death_count)
```


* **`PctEmpPrivCoverage`:** We assume that the values in this variable represent a subset of values in `PctPrivateCoverage`, since the sum of these two variables in some rows is above 100.

* **Overlap between `PctPrivateCoverage` and `PctPublicCoverage`:** We assume that there is an overlap between people that have public health insurance and those with private health insurance, since the sum of `PctPrivateCoverage` and `PctPublicCoverage` in some rows is above 100. In fact, this is not uncommon among some senior citizenz that have both Medicare and a supplementary private health plan (aka. Medigap).

```{r collapse = TRUE, comment=NA}
#adding up health insurance coverage variables to check for overlaps
Cancer$Pct_insured <- Cancer$PctPrivateCoverage + Cancer$PctPublicCoverage
Cancer$Pct_PersonalIsure <- Cancer$PctPrivateCoverage + Cancer$PctEmpPrivCoverage
print('Cancer$Pct_insured')
summary(Cancer$Pct_insured)
print('Cancer$Pct_PersonalIsure')
summary(Cancer$Pct_PersonalIsure)
```


#### Data transformations

Based on the data evaluation mentiond before and additional analysis, we are transforming some of the valuables that have issues, as explained below. 

```{r  collapse=TRUE, comment=NA}
#creating separarte County and State columns to enable state-wide analysis
Cancer <- Cancer %>% separate(Geography, c("County", "State"), sep = ", ", remove = FALSE) 
#replacing erroneous values with NA
Cancer$MedianAge[Cancer$MedianAge > 100] <- NA
Cancer$AvgHouseholdSize[Cancer$AvgHouseholdSize < 1] <- NA
# Define new binned income with equal sized bins
bins <- seq(20000, 130000, by = 10000)
Cancer$binnedInc2 <- cut(Cancer$medIncome, breaks = bins)
# Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA
# Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
#creating a new variable to represent actual number of deaths due to cancer per county
Cancer$death_count <- Cancer$deathRate * Cancer$popEst2015/100000
#creating a new variable to represent the percentage of population with health insurance
Cancer$Pct_insured <- Cancer$PctPrivateCoverage + Cancer$PctPublicCoverage
```


# Univariate Analysis of Key Variables

Even though the presentation of this section takes a linear form, the actual analysis of key variables was an iterative process. The key variables were chosen based on:
* Our initial hypotheses regarding variables were potentially related to `deathRate`.
* The possibility that a variable/factor could be changed through interventions implemented by government health agencies to improve cancer prevention and survival.
* Additional analysis that we performed to identify variables that actually had a correlation with the dependent variable.

After selecting the key variables, our approach was to focus on assessing the quality of the data (as partly explained in the Introduction) and detecting features, through univariate analysis, that are important to include when modelling the relationships of interest, such as particular features in the distributions, unusual concentrations of observations around certain values, the presence of outliers and extreme outliers, among others.

### Death rate

Death rate's distribution is symmetric and bell-shaped, with a small amount of outliers at both sides of the mean (2.1% of outliers, with 0.03% of extreme outliers). However, these outliers are still within a reasonable range and do not seem to be errors in the data. Furthermore, the observation corresponding to the only extreme outlier does not look atypical based on the values of the other variables.

Finally, using both summary metrics and visualizations, we did not find any unusual concentration of observations around specific values.


```{r  collapse=TRUE, comment=NA}
summary(Cancer$deathRate)
```
```{r}
boxHist(Cancer$deathRate, "Death rate (nummber of deaths per 100k people)")
``` 

```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, 'deathRate')
```

```{r  collapse=TRUE, comment=NA}
Cancer[Cancer$deathRate > 300, ]
```

### Incidence (DEATILS MIGHT BE HIDDEN TO SAVE SPACE)

Looking at the frequency of unique values in 'AvgAnnCount', we found that 206 observations contain the value 1962.667684. This is very likely an error because the values in this variable should all be integers, and in some cases this value is higher than the county population.

```{r incidence_freq, collapse=TRUE, comment=NA}
incidence_freq <- data.frame(table(Cancer$avgAnnCount))
incidence_freq[incidence_freq$Freq > 20, ]
```

```{r  collapse=TRUE, comment=NA}
table(Cancer$avgAnnCount > Cancer$popEst2015)
```

Furthermore, these values are causing the incidence rate (that we will build to be able to compare death with incidence) to have extremely large values.

Incidence rate contains 188 extremely large values (higher than 1500 cases per 100,000 people). As can be seen below, all of these values are caused by the error in AvgAnnCount.

```{r  collapse=TRUE, comment=NA}
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
table(Cancer$incidenceRate > 1500)
table(Cancer$incidenceRate[Cancer$avgAnnCount != 1962.667684] > 1500)
```

Therefore, we decided to remove these "1962.667684" values and replace them with NA.

```{r  collapse=TRUE, comment=NA}
Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA
#creating new variable to represent cancer incidence per 100K people per county
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000 
```

```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, 'avgAnnCount')
```


### Incidence rate

The distribution of the incidence rate is unimodal and positively skewed, with 46 outliers and 1 extreme outlier. Since these values represent only 1.5% of observations and there is no furhter evidence that they are errors, we will keep them, but this should be taken into account when modelling the relationship between incidence and death rates.

```{r  collapse=TRUE, comment=NA}
summary(Cancer$incidenceRate)
```

```{r  collapse=TRUE, comment=NA}
boxHist(Cancer$incidenceRate, "Incidence rate (new cases per 100k people)")
```

```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, 'incidenceRate')
```

### Median income

There are two income variables available: binned income and median income. From these two, we chose median income as our key variable because it is more granular than binned income and, second, because the width of the bins in binned income seems to have been defined such that there are the same number of observations in each bin, which is not useful to observe its distribution, and the cutoffs chosen make the charts hard to read.

```{r  collapse=TRUE, comment=NA}
summary(Cancer$binnedInc)
```

Below, we can see that the median income is inded a good candidate, since it doesn't vary as much as income typically does (in this case, the difference between the minimum and maximum values is less than one order of magnitude), representing better the "average" member of each county. However, it's distribution is positively sekewed, with 64 counties that have median income greater than 80,000 USD.

```{r  collapse=TRUE, comment=NA}
summary(Cancer$medIncome)
```

```{r  collapse=TRUE, comment=NA}
sum(Cancer$medIncome > 80000)
```

```{r  collapse=TRUE, comment=NA}
boxHist(Cancer$medIncome, "Median income")
```

Including the 64 observations above that contribute to the positive skewness of this variable, there are still 122 outliers (around 4% of the total observations) that need to be taken into account when building the statistical model that captures the relationship between this variable and the death rate.

```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, "medIncome")
```

Given the rather large number of outliers in this variable, once might consider log transformation. However, we decided to follow the rule provided by Fox (2011), where log transformation is only likely to make a difference if its values "cover two or more orders of magnitude" (Fox, p. 128).


### Education

To measure education, we have six possible candidates: 'PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24', 'PctHS25_Over' and 'PctBachDeg25_Over', that can be divided into two groups: 18-24 and '25 and above' years old. Our initial hypothesis was that the second group should have a stronger correlation with death rate. We validated this hypothesis with the correlations table shown below, that shows the only variable that has correlation with deathRate is `PctBachDeg18_24` (although this correlation is not strong, -0.31). Instead, as expected, the two '25 and above' education variables have a much greater correlation with `deathRate`.

Therefore, we will focus on these two variables for further analyses on education.

```{r  collapse=TRUE, comment=NA}
cor(Cancer[, names(Cancer) %in% 
           c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24', 
             'PctHS25_Over', 'PctBachDeg25_Over', 'deathRate')], use = 'complete.obs')[7, ]
```

We also validated that education variables within each group are mutually exclusive, by making sure that they add up to 100% for all observations that have complete data. The total values range between 99.9 and 100.1, where the small variations around 100 are most probably due to rounding.

We can only test this with the 18-24 group since the 25_over group is missing two variables that capture 'no high school' and 'some college'. However, it is reasonable to assume that the same definition is applied to our group of interest (25_over).

```{r  collapse=TRUE, comment=NA}
educ.18.24 <- c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24')
educ.df <- subset(Cancer, select = educ.18.24)
educ.complete <- complete.cases(educ.df)
sum.pct.freq <- data.frame(table(rowSums(educ.df[educ.complete, ], na.rm = TRUE)))
names(sum.pct.freq) <- c("Sum", "Frequency")
sum.pct.freq
```


#### PctHS25_over

Values in `PctHS25` are within a reasonable range (7 to 55%) and there doesn't seem to be an unusual concentration of observations around certain values. Also, the disribution of this variable is unimodal and negatively skewed. However, it only contains 31 outliers (1% of observations) and there are no extreme outliers. Furthermore, there is no indication that these outliers are errors, so we decided to keep them.

```{r  collapse=TRUE, comment=NA}
summary(Cancer$PctHS25_Over)
```


```{r  collapse=TRUE, comment=NA}
boxHist(Cancer$PctHS25_Over, "Percentage age 25 or older with high school only")
```


```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctHS25_Over")
```


#### PctBachDeg25_Over

Values in `PctHS25_Over` are within a reasonable range (7% to 55%) and there doesn't seem to be an unusual concentration of observations around certain values. The disribution of this variable is unimodal and positively skewed. It contains 82 outliers (2.7% of observations) all of which at are at the right side of the mean. Of these 82 outliers, only 5 are extreme outliers, that we will keep in the dataset, since there are no indications that they are errors.

```{r  collapse=TRUE, comment=NA}
summary(Cancer$PctBachDeg25_Over)
```


```{r  collapse=TRUE, comment=NA}
boxHist(Cancer$PctBachDeg25_Over, "Percentage age 25 or older with bachelors degree only")
```


Extreme outliers
```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctBachDeg25_Over")
```


### Poverty percent

The distribution of `povertyPercent` is unimodal and positively skewed. This is reflected by the fact that all outliers are at the right of the mean. Taking a deeper dive into the outliers, we found that only 3 are extreme while 66 are mild. For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable.

However, when modeling the relationship of interest, we should take into account that the distribution of this variable is not normal and it might need transformation if the model used requires it.

```{r  collapse=TRUE, comment=NA}
summary(Cancer$povertyPercent)
```

```{r  collapse=TRUE, comment=NA}
boxHist(Cancer$povertyPercent, "Percentage age 25 or older with up to bachelors degree")
```

Extreme outliers
```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, "povertyPercent")
```


### Percentage employed (16 or over)

The distribution of `PctEmployed16_Over` is unimodal and negatively skewed. There are no extreme outliers and 20 mild outliers (0.7% of observations). For this reason, and because we did not find any indication that the outliers or other values were errors, we will keep all the values within this variable.


```{r  collapse=TRUE, comment=NA}
summary(Cancer$PctEmployed16_Over)
```

```{r  collapse=TRUE, comment=NA}
boxHist(Cancer$PctEmployed16_Over, "Percentage age 25 or older with up to bachelors degree")
```

Extreme outliers
```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctEmployed16_Over")
```

### Percentage with public coverage

The distribution of `PctPublicCoverage` is unimodal and symmetric, with no extreme outliers and only 18 mild outliers (0.6% of observations). For this reason, and because we did not find any indication that the outliers or other values were errors, we will keep all data from this variable. There are also no other particular features from this variables that grant further warnings in modelling the relationship with `deathRate`.


```{r  collapse=TRUE, comment=NA}
summary(Cancer$PctPublicCoverage)
```

```{r  collapse=TRUE, comment=NA}
boxHist(Cancer$PctPublicCoverage, "Percentage age 25 or older with up to bachelors degree")
```

Extreme outliers
```{r  collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctPublicCoverage")
```


## Analysis of Key Relationships

### Education

As explained above, guided by our hypothesis that the education of the '25 and over' years old group should have a much stronger relationship with deathRate than the '18-24' years old group, which was supported by the correlations between these variables, we will focus on the former group.

#### PctHS25_over

A correlation of $0.4$ between `PctHS25_over` and `deathRate` indicates that there is indeed a relationship between these variables, which is further indicated by plotting them together in a scatterplot, that shows that higher values of percentage of population with only high school tend to be associated to higher death rates (this is also reflected in the regression line added to the scatterplot).

Such relationship is not unexpected, since it indicates that a higher concentration of people with low education levels may have poorer health habits and limited access to medical services. However, both of these variables could be affeccted by `MedianAge` in the same direction: older populations might have lower levels of higher education and higher rates of death.

```{r collapse=TRUE, comment=NA}
cor(Cancer$deathRate, Cancer$PctHS25_Over)
```

```{r  collapse=TRUE, comment=NA}
plot(Cancer$PctHS25_Over, Cancer$deathRate, main = "HS (>24)")
abline(lm(Cancer$deathRate ~ Cancer$PctHS25_Over), lty = 'dashed', lwd = 2, col = 'red')
```


#### PctBachDeg25_0ver

A correlation of $-0.48$ indicates that there is relationship between `PctBachDeg25_over` and `deathRate`, which is further supported by plotting these variables in a scatterplot, where it can be seen that a higher percentage of people with bachelor's degree is associated with lower levels of death rates. This relationship is  also supported by the regression line included in the scatterplot.

This is also not an unusual relationship, since higher levels of education might be linked to better health habits and access to health services. However, and following the same reasoning than `PctHS25_over`, the relationship between these two variables may be confounded by `MedianAge`, although it is not clear in which direction this effect might go. Therefore, it will also be necessary to explore the effect of `MedianAge` in the following section.

```{r  collapse=TRUE, comment=NA}
cor(Cancer$deathRate, Cancer$PctBachDeg25_Over)
```

```{r  collapse=TRUE, comment=NA}
plot(Cancer$PctBachDeg25_Over, Cancer$deathRate, main = "Bachelor (>24)")
abline(lm(Cancer$deathRate ~ Cancer$PctBachDeg25_Over), lty = 'dashed', lwd = 2, col = 'red')
```


## Analysis of Secondary Effects

Throughout the analyses above, we began to identify that some of the relationships found between `deathRate` and other variables may not only be capturing the direct relationship between these variables, but also those of additional variable(s) that may be impacting both. To further assess this systematically, the following network visualization shows the variables that have a correlation higher than 0.4, where each node represents a different variable and each vertex indicates the strength of the relationship between the variables connected.


#![secondary_analysis]
#(secondary_analysis.png)

### Age and family/householdd

`PercentMarried` has a (weak) relation and `AvgHouseholdSize` has a moderate relation with `MedianAge`. Based on these results, we explored this relationship further.

```{r  collapse=TRUE, comment=NA}
cor(subset(Cancer, 
           select = c("MedianAge", "AvgHouseholdSize", "PercentMarried")), 
    use = "pairwise.complete.obs")[1, ]
```

Both the scatterplots below and the regression lines imposed on them provide further support that there is indeed a relationship between these two variabes and `MedianAge`, indicating that `MedianAge` may confound the relationship between these two and `deathRate`. Therefore, this should be taken into account when modelling the relationships of interest, in order to isolate the effect of the family variables on the death rate.

```{r  collapse=TRUE, comment=NA}
plot(Cancer$MedianAge, Cancer$PercentMarried, main = "Age vs PercentMarried")
abline(lm(PercentMarried ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
```

```{r  collapse=TRUE, comment=NA}
plot(Cancer$MedianAge, Cancer$AvgHouseholdSize, main = "Age vs Average household size")
abline(lm(AvgHouseholdSize ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
```


### Black population and employment

Correlation analysis shows that there is a relationship between the percentage of black population and employment, which is further confirmed both by a visual inspection of the scatterplot and the linear regression line charted in this plot. Since employment es related to `deathRate`, its correlation with `PctBlack` may indicate that this variable may be confounding the relationship of interest and thus further modeling needs to take this into account, to isolate the effect of unemployment on death rate.

```{r  collapse=TRUE, comment=NA}
cor(subset(Cancer, select = c("PctBlack", "PctUnemployed16_Over")),
    use = "pairwise.complete.obs")[, 1]
```

```{r  collapse=TRUE, comment=NA}
plot(Cancer$PctBlack, Cancer$PctUnemployed16_Over, main = "Age vs PercentMarried")
abline(lm(PctUnemployed16_Over ~ PctBlack, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
```


### State

A boxplot containing different location measures of `deathRate` by `State` shows that these values vary measures vary significantly across state. Since `State` may be capturing several state-level characteristics that may in turn affect other variables that have a relation with `deathRate`, it is recommended to include state-level effects when modeling the relation of interest, to control for confounding these state-level factors.

```{r  collapse=TRUE, comment=NA}
boxplot(Cancer$deathRate ~ Cancer$State)
```