-
Notifications
You must be signed in to change notification settings - Fork 0
/
Cadavid_Nonis_Roghani_lab1_v2.Rmd
590 lines (393 loc) · 29.6 KB
/
Cadavid_Nonis_Roghani_lab1_v2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
---
title: "An Exploratory Analysis of Cancer Incidence and Mortality to Identify High-Risk Communities and Improve Survival"
author: "Ramiro Cadavid, Pri Nonis, Payman Roghani"
date: "September 24, 2018"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Setup
```{r include=FALSE}
library(dplyr)
library(tidyr)
library(repr)
library(car)
library(fBasics)
library(xtable)
source("box_hist.R")
source("outliers.R")
options(repr.plot.width=5, repr.plot.height=5)
```
## Introduction
In this project our efforts are focused on the analysis of data included in the csv file provided, to primarily understand the potential relationship between different parameters and the incidences of cancer across counties in the US. The main objectives are:
>1. To understand factors that predict cancer mortality rate, with the ultimate aim of identifying communities for social interventions.
>2. To determine which interventions are likely to have the most impact.
### Cancer Data
```{r collapse=TRUE, comment=NA}
Cancer <- read.csv('cancer.csv', row.names = 1)
```
```{r collapse=TRUE, comment=NA, paged.print=TRUE}
summary(Cancer) #summary statistics
```
```{r ollapse = TRUE, comment=NA}
str(Cancer, strict.width = "wrap")
```
```{r collapse = TRUE, comment=NA}
colnames(Cancer)
cat(" \n")
print(paste0('Number of rows: ', nrow(Cancer)))
print(paste0('Number of columns: ', ncol(Cancer)))
```
The cancer.csv file contains 29 variables (30 columns, including the first one that only contains the row numbers) and 3,047 observations. Each observation (i.e. row) includes data for a county across the US. The variables are mostly numbers and integers, except for 2 that are factors (`binnedInc` and `Geography`). Below, we explain the variables in detail and provide our assessment of the quality of the data.
#### Variables
* Cancer data:
+ `avgAnnCount`: The average number of new cancer cases per year per county for years 2009-2013
+ `popEst2015`: Estimated population by county 2015
* Economic status:
+ `medIncome`: Median income per county
+ `povertyPercent`: Percent of population below poverty line
+ `binnedInc`: ???
* Population age and gender:
+ `MedianAge`: Median age per county
+ `MedianAgeMale`: Median age among males per county
+ `MedianAgeFemale`: Median age among females per county
* Location:
+ `Geography`: County, State names
* Marital status:
+ `PercentMarried`: Percentage of married population
+ `PctMarriedHouseholds`: Percentage of married households per county
* Education:
+ `PctNoHS18_24`: Percentage of 18-24 year old population with no high school education
+ `PctHS18_24`: Percentage of 18-24 year old population with high school education
+ `PctSomeCol18_24`: Percentage of 18-24 year old population with some college education
+ `PctBachDeg18_24`: Percentage of 18-24 year old population with bachelor's degree
+ `PctHS25_Over`: Percentage of population above 25 years old with high school education
+ `PctBachDeg25_Over`: Percentage of population above 25 years old with bachelor's degree
* Household size:
+ `AvgHouseholdSize`: Average household size per county
* Employment status:
+ `PctEmployed16_Over`: Percentage of population above 16 years old who have jobs
+ `PctUnemployed16_Over`: Percentage of population above 16 years old with no jobs
* Health insurance coverage:
+ `PctPrivateCoverage`: Percentage of the population with private insurance coverage
+ `PctEmpPrivCoverage`: percentage of the population with employer-sponsored insurance coverage
+ `PctPublicCoverage`: Percentage of the population with public insurance coverage
* Race:
+ `PctWhite`: Percentage of white population by county
+ `PctBlack`: Percentage of African-American population by county
+ `PctAsian`: Percentage of Asian population by county
+ `PctOtherRace`: Percentage of other races by county
* Birth and death rates:
+ `BirthRate`: Birth rate per county
+ `deathRate`: Death rate per county
#### Evaluation of Dataset and Variables
Based on the outputs from diagnostic and summary statistics functions, as well as further univariate analysis, using relevant charts, below we describe our evaluation of the dataset and its variables. Since definition of most variables was not provided to us, our first step was to ensure understanding of what exactly such variables represent. We also evaluated the data to identify potentially erroneous values, extreme outliers and variables that might require transformation.
* **Data time frame:**: While `avgAnnCount` represents statistics for 2009-2013, the population by county is for 2015 and other variables do not have date stamps. Ideally all variables should have been from the same time period.
* **`avgAnnCount` definition:**: There is no clear definition for incidence rate per county for the `avgAnnCount` variable. Since the sum of all values is 1,847,514 and that based on [Cancer.gov](https://www.cancer.gov/)) data the average number of cases for 2009-2013 is 1,617,144, we assume this variable represents the actual count of new cases. Therefore, in our analysis we created a new variable called `incidenceRate` to represent the incidence rate of cancer per 100,000 people per county to be able to compare the spread of new cancer cases in different geographical regions regarldess of the actual population of such regions.
```{r collapse = TRUE, comment=NA}
#caclulating the total for avgAnnCount to compare with offical reports by Cancer.gov
sum(Cancer$avgAnnCount)
```
Official Cancer Statistics, 2009-2013
|Year | New Cases | Deaths|
|-----|-----|-------|
|2009 | 1,660,290 | 562,340 |
|2010 | 1,529,560 | 569,490 |
|2011 | 1,596,670 | 571,950 |
2012 | 1,638,910 | 577,190 |
2013 | 1,660,290 | 580,350 |
Source: [Cancer.gov](https://www.cancer.gov/)
```{r collapse = TRUE, comment=NA}
#calculating the mean of the number of new cancer cases for years 2009-2013
#based on Cancer.gov data, in order to cofirm our assumption regarding avgAnnCount
incidence_cancer_gv <- c(1660290, 1529560, 1596670, 1638910, 1660290)
mean(incidence_cancer_gv)
```
* **Anomaly in `avgAnnCount`:** Through our assessment, we noticed that the number of new cancer cases (`avgAnnCount`) for 6 counties were greater than those counties' populations (`popEst2015`). Looking at the 6 observations, we realized that the value assigned to `popEst2015` for all these 6 counties is exactly the same number (1962.667684). In fact there are a total of 206 counties that have exactly the same average number of new cases, which is probably an error in the dataset. We decided to replace all of them with NA in our analysis.
* **`Geography`:** We checked this variable to identify potential duplicates. Since the number of unique values in this column (3,047) is equal to the total number of observations, there can not be any duplicates in this column.
```{r collapse = TRUE, comment=NA}
#checking for potential dubplicates in this variable
length(unique(Cancer[["Geography"]]))
```
* **`binnedInc`:** This variable has 10 levels that seem arbitrary and have different bin sizes. It is not clear why the income bins have been defined this way. As a result, we decided to ignore it in our analysis.
* **Anomaly in `MedianAge`:** The maximum `MedianAge` shows a value of 624, which is clearly a wrong number. We actually identified a total of 30 values in this column that are above 100; therefore, we will replace such values with NA in our analysis.
```{r collapse = TRUE, comment=NA}
#checking the number of erroneous values
age_error = subset(Cancer, MedianAge > 100)
nrow(age_error)
```
* **Anomaly in `AvgHouseholdSize`:** The minimum for `AvgHouseholdSize` is 0.0221, which does not make sense, since we do not expect a household size below 1. There are 61 values in this column that are below 1, which we will replace with NA in our analysis.
```{r collapse = TRUE, comment=NA}
#checking the number of erroneous values
household_error = subset(Cancer, AvgHouseholdSize < 1)
nrow(household_error)
```
* **`PctSomeCol18_24`:** 75% of values within this variable are NAs (2285 out 3047). Therefore, we decided to ignore this variable in our analysis.
* **`BirthRate`:** It is not clear what exactly this represents. Often, the birth rate is defined as childbirths per 1,000 people per year, but applying that to this variable would not give us the right number. For example in Los Angeles County with the population of 10,170,292, there were 124,641 live births in 2015 based on official reports, which translates into a birth rate of 12.25 (BR = (b / p) X 1,000). However, the birth rate in our data shows a value of 4.7 for this county, which is probably the ratio of women aged 15-50 years old who gave birth in 2015 as reported by [TownCharts](http://www.towncharts.com/California/Demographics/Los-Angeles-County-CA-Demographics-data.html)). As a result, we decided to ignore this variable in our analysis.
```{r collapse = TRUE, comment=NA}
#checking the BirthRate value for Los Angeles County
Cancer[1000,'BirthRate']
```
```{r collapse = TRUE, comment=NA}
#Calculating LA County birth rate based on official figures. Formula: BR = (b ÷ p) X 1,000
124641/10170292*1000
```
* **`deathRate`:** Based on our assessment, we believe this variable represents the number of deaths due to cancer per 100,000 population per county. For instance, we looked at the figure for Kings County, NY (173.6) and the number in our data is closer to the officially reported cancer death rate (140.3), as opposed to overall death rate (603.1). We also calculated the actual number of deaths per county (`deathRate` * `popEst2015` / 100000) and the total for these values, which is eaual to 525,347. This number is pretty close to the figure reported by [Cancer.gov](https://www.cancer.gov/) (589,430), further confirming our assumption regarding `deathRate`.
```{r collapse = TRUE, comment=NA}
#checking the deathRate for Kings County, NY
Cancer[388, 'deathRate']
```
Kings County, NY statistics:
2015 population: 2,673,000
2015 death rate (per 100,000 population): 603.1
2015 Cancer death rate (per 100,000 population): 140.3
Sources: [DATA USA](https://datausa.io/),
[NY State Dpt of Health](https://www.health.ny.gov/)
```{r collapse = TRUE, comment=NA}
#comparing total death count in our dataset with Cancer.gov stats
Cancer$death_count <- Cancer$deathRate * Cancer$popEst2015/100000
sum(Cancer$death_count)
```
* **`PctEmpPrivCoverage`:** We assume that the values in this variable represent a subset of values in `PctPrivateCoverage`, since the sum of these two variables in some rows is above 100.
* **Overlap between `PctPrivateCoverage` and `PctPublicCoverage`:** We assume that there is an overlap between people that have public health insurance and those with private health insurance, since the sum of `PctPrivateCoverage` and `PctPublicCoverage` in some rows is above 100. In fact, this is not uncommon among some senior citizenz that have both Medicare and a supplementary private health plan (aka. Medigap).
```{r collapse = TRUE, comment=NA}
#adding up health insurance coverage variables to check for overlaps
Cancer$Pct_insured <- Cancer$PctPrivateCoverage + Cancer$PctPublicCoverage
Cancer$Pct_PersonalIsure <- Cancer$PctPrivateCoverage + Cancer$PctEmpPrivCoverage
print('Cancer$Pct_insured')
summary(Cancer$Pct_insured)
print('Cancer$Pct_PersonalIsure')
summary(Cancer$Pct_PersonalIsure)
```
#### Data transformations
Based on the data evaluation mentiond before and additional analysis, we are transforming some of the valuables that have issues, as explained below.
```{r collapse=TRUE, comment=NA}
#creating separarte County and State columns to enable state-wide analysis
Cancer <- Cancer %>% separate(Geography, c("County", "State"), sep = ", ", remove = FALSE)
#replacing erroneous values with NA
Cancer$MedianAge[Cancer$MedianAge > 100] <- NA
Cancer$AvgHouseholdSize[Cancer$AvgHouseholdSize < 1] <- NA
# Define new binned income with equal sized bins
bins <- seq(20000, 130000, by = 10000)
Cancer$binnedInc2 <- cut(Cancer$medIncome, breaks = bins)
# Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA
# Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
#creating a new variable to represent actual number of deaths due to cancer per county
Cancer$death_count <- Cancer$deathRate * Cancer$popEst2015/100000
#creating a new variable to represent the percentage of population with health insurance
Cancer$Pct_insured <- Cancer$PctPrivateCoverage + Cancer$PctPublicCoverage
```
# Univariate Analysis of Key Variables
Even though the presentation of this section takes a linear form, the actual analysis of key variables was an iterative process. The key variables were chosen based on:
* Our initial hypotheses regarding variables were potentially related to `deathRate`.
* The possibility that a variable/factor could be changed through interventions implemented by government health agencies to improve cancer prevention and survival.
* Additional analysis that we performed to identify variables that actually had a correlation with the dependent variable.
After selecting the key variables, our approach was to focus on assessing the quality of the data (as partly explained in the Introduction) and detecting features, through univariate analysis, that are important to include when modelling the relationships of interest, such as particular features in the distributions, unusual concentrations of observations around certain values, the presence of outliers and extreme outliers, among others.
### Death rate
Death rate's distribution is symmetric and bell-shaped, with a small amount of outliers at both sides of the mean (2.1% of outliers, with 0.03% of extreme outliers). However, these outliers are still within a reasonable range and do not seem to be errors in the data. Furthermore, the observation corresponding to the only extreme outlier does not look atypical based on the values of the other variables.
Finally, using both summary metrics and visualizations, we did not find any unusual concentration of observations around specific values.
```{r collapse=TRUE, comment=NA}
summary(Cancer$deathRate)
```
```{r}
boxHist(Cancer$deathRate, "Death rate (nummber of deaths per 100k people)")
```
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, 'deathRate')
```
```{r collapse=TRUE, comment=NA}
Cancer[Cancer$deathRate > 300, ]
```
### Incidence (DEATILS MIGHT BE HIDDEN TO SAVE SPACE)
Looking at the frequency of unique values in 'AvgAnnCount', we found that 206 observations contain the value 1962.667684. This is very likely an error because the values in this variable should all be integers, and in some cases this value is higher than the county population.
```{r incidence_freq, collapse=TRUE, comment=NA}
incidence_freq <- data.frame(table(Cancer$avgAnnCount))
incidence_freq[incidence_freq$Freq > 20, ]
```
```{r collapse=TRUE, comment=NA}
table(Cancer$avgAnnCount > Cancer$popEst2015)
```
Furthermore, these values are causing the incidence rate (that we will build to be able to compare death with incidence) to have extremely large values.
Incidence rate contains 188 extremely large values (higher than 1500 cases per 100,000 people). As can be seen below, all of these values are caused by the error in AvgAnnCount.
```{r collapse=TRUE, comment=NA}
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
table(Cancer$incidenceRate > 1500)
table(Cancer$incidenceRate[Cancer$avgAnnCount != 1962.667684] > 1500)
```
Therefore, we decided to remove these "1962.667684" values and replace them with NA.
```{r collapse=TRUE, comment=NA}
Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA
#creating new variable to represent cancer incidence per 100K people per county
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
```
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, 'avgAnnCount')
```
### Incidence rate
The distribution of the incidence rate is unimodal and positively skewed, with 46 outliers and 1 extreme outlier. Since these values represent only 1.5% of observations and there is no furhter evidence that they are errors, we will keep them, but this should be taken into account when modelling the relationship between incidence and death rates.
```{r collapse=TRUE, comment=NA}
summary(Cancer$incidenceRate)
```
```{r collapse=TRUE, comment=NA}
boxHist(Cancer$incidenceRate, "Incidence rate (new cases per 100k people)")
```
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, 'incidenceRate')
```
### Median income
There are two income variables available: binned income and median income. From these two, we chose median income as our key variable because it is more granular than binned income and, second, because the width of the bins in binned income seems to have been defined such that there are the same number of observations in each bin, which is not useful to observe its distribution, and the cutoffs chosen make the charts hard to read.
```{r collapse=TRUE, comment=NA}
summary(Cancer$binnedInc)
```
Below, we can see that the median income is inded a good candidate, since it doesn't vary as much as income typically does (in this case, the difference between the minimum and maximum values is less than one order of magnitude), representing better the "average" member of each county. However, it's distribution is positively sekewed, with 64 counties that have median income greater than 80,000 USD.
```{r collapse=TRUE, comment=NA}
summary(Cancer$medIncome)
```
```{r collapse=TRUE, comment=NA}
sum(Cancer$medIncome > 80000)
```
```{r collapse=TRUE, comment=NA}
boxHist(Cancer$medIncome, "Median income")
```
Including the 64 observations above that contribute to the positive skewness of this variable, there are still 122 outliers (around 4% of the total observations) that need to be taken into account when building the statistical model that captures the relationship between this variable and the death rate.
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, "medIncome")
```
Given the rather large number of outliers in this variable, once might consider log transformation. However, we decided to follow the rule provided by Fox (2011), where log transformation is only likely to make a difference if its values "cover two or more orders of magnitude" (Fox, p. 128).
### Education
To measure education, we have six possible candidates: 'PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24', 'PctHS25_Over' and 'PctBachDeg25_Over', that can be divided into two groups: 18-24 and '25 and above' years old. Our initial hypothesis was that the second group should have a stronger correlation with death rate. We validated this hypothesis with the correlations table shown below, that shows the only variable that has correlation with deathRate is `PctBachDeg18_24` (although this correlation is not strong, -0.31). Instead, as expected, the two '25 and above' education variables have a much greater correlation with `deathRate`.
Therefore, we will focus on these two variables for further analyses on education.
```{r collapse=TRUE, comment=NA}
cor(Cancer[, names(Cancer) %in%
c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24',
'PctHS25_Over', 'PctBachDeg25_Over', 'deathRate')], use = 'complete.obs')[7, ]
```
We also validated that education variables within each group are mutually exclusive, by making sure that they add up to 100% for all observations that have complete data. The total values range between 99.9 and 100.1, where the small variations around 100 are most probably due to rounding.
We can only test this with the 18-24 group since the 25_over group is missing two variables that capture 'no high school' and 'some college'. However, it is reasonable to assume that the same definition is applied to our group of interest (25_over).
```{r collapse=TRUE, comment=NA}
educ.18.24 <- c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24')
educ.df <- subset(Cancer, select = educ.18.24)
educ.complete <- complete.cases(educ.df)
sum.pct.freq <- data.frame(table(rowSums(educ.df[educ.complete, ], na.rm = TRUE)))
names(sum.pct.freq) <- c("Sum", "Frequency")
sum.pct.freq
```
#### PctHS25_over
Values in `PctHS25` are within a reasonable range (7 to 55%) and there doesn't seem to be an unusual concentration of observations around certain values. Also, the disribution of this variable is unimodal and negatively skewed. However, it only contains 31 outliers (1% of observations) and there are no extreme outliers. Furthermore, there is no indication that these outliers are errors, so we decided to keep them.
```{r collapse=TRUE, comment=NA}
summary(Cancer$PctHS25_Over)
```
```{r collapse=TRUE, comment=NA}
boxHist(Cancer$PctHS25_Over, "Percentage age 25 or older with high school only")
```
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctHS25_Over")
```
#### PctBachDeg25_Over
Values in `PctHS25_Over` are within a reasonable range (7% to 55%) and there doesn't seem to be an unusual concentration of observations around certain values. The disribution of this variable is unimodal and positively skewed. It contains 82 outliers (2.7% of observations) all of which at are at the right side of the mean. Of these 82 outliers, only 5 are extreme outliers, that we will keep in the dataset, since there are no indications that they are errors.
```{r collapse=TRUE, comment=NA}
summary(Cancer$PctBachDeg25_Over)
```
```{r collapse=TRUE, comment=NA}
boxHist(Cancer$PctBachDeg25_Over, "Percentage age 25 or older with bachelors degree only")
```
Extreme outliers
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctBachDeg25_Over")
```
### Poverty percent
The distribution of `povertyPercent` is unimodal and positively skewed. This is reflected by the fact that all outliers are at the right of the mean. Taking a deeper dive into the outliers, we found that only 3 are extreme while 66 are mild. For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable.
However, when modeling the relationship of interest, we should take into account that the distribution of this variable is not normal and it might need transformation if the model used requires it.
```{r collapse=TRUE, comment=NA}
summary(Cancer$povertyPercent)
```
```{r collapse=TRUE, comment=NA}
boxHist(Cancer$povertyPercent, "Percentage age 25 or older with up to bachelors degree")
```
Extreme outliers
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, "povertyPercent")
```
### Percentage employed (16 or over)
The distribution of `PctEmployed16_Over` is unimodal and negatively skewed. There are no extreme outliers and 20 mild outliers (0.7% of observations). For this reason, and because we did not find any indication that the outliers or other values were errors, we will keep all the values within this variable.
```{r collapse=TRUE, comment=NA}
summary(Cancer$PctEmployed16_Over)
```
```{r collapse=TRUE, comment=NA}
boxHist(Cancer$PctEmployed16_Over, "Percentage age 25 or older with up to bachelors degree")
```
Extreme outliers
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctEmployed16_Over")
```
### Percentage with public coverage
The distribution of `PctPublicCoverage` is unimodal and symmetric, with no extreme outliers and only 18 mild outliers (0.6% of observations). For this reason, and because we did not find any indication that the outliers or other values were errors, we will keep all data from this variable. There are also no other particular features from this variables that grant further warnings in modelling the relationship with `deathRate`.
```{r collapse=TRUE, comment=NA}
summary(Cancer$PctPublicCoverage)
```
```{r collapse=TRUE, comment=NA}
boxHist(Cancer$PctPublicCoverage, "Percentage age 25 or older with up to bachelors degree")
```
Extreme outliers
```{r collapse=TRUE, comment=NA}
outliers.summ(Cancer, "PctPublicCoverage")
```
## Analysis of Key Relationships
### Education
As explained above, guided by our hypothesis that the education of the '25 and over' years old group should have a much stronger relationship with deathRate than the '18-24' years old group, which was supported by the correlations between these variables, we will focus on the former group.
#### PctHS25_over
A correlation of $0.4$ between `PctHS25_over` and `deathRate` indicates that there is indeed a relationship between these variables, which is further indicated by plotting them together in a scatterplot, that shows that higher values of percentage of population with only high school tend to be associated to higher death rates (this is also reflected in the regression line added to the scatterplot).
Such relationship is not unexpected, since it indicates that a higher concentration of people with low education levels may have poorer health habits and limited access to medical services. However, both of these variables could be affeccted by `MedianAge` in the same direction: older populations might have lower levels of higher education and higher rates of death.
```{r collapse=TRUE, comment=NA}
cor(Cancer$deathRate, Cancer$PctHS25_Over)
```
```{r collapse=TRUE, comment=NA}
plot(Cancer$PctHS25_Over, Cancer$deathRate, main = "HS (>24)")
abline(lm(Cancer$deathRate ~ Cancer$PctHS25_Over), lty = 'dashed', lwd = 2, col = 'red')
```
#### PctBachDeg25_0ver
A correlation of $-0.48$ indicates that there is relationship between `PctBachDeg25_over` and `deathRate`, which is further supported by plotting these variables in a scatterplot, where it can be seen that a higher percentage of people with bachelor's degree is associated with lower levels of death rates. This relationship is also supported by the regression line included in the scatterplot.
This is also not an unusual relationship, since higher levels of education might be linked to better health habits and access to health services. However, and following the same reasoning than `PctHS25_over`, the relationship between these two variables may be confounded by `MedianAge`, although it is not clear in which direction this effect might go. Therefore, it will also be necessary to explore the effect of `MedianAge` in the following section.
```{r collapse=TRUE, comment=NA}
cor(Cancer$deathRate, Cancer$PctBachDeg25_Over)
```
```{r collapse=TRUE, comment=NA}
plot(Cancer$PctBachDeg25_Over, Cancer$deathRate, main = "Bachelor (>24)")
abline(lm(Cancer$deathRate ~ Cancer$PctBachDeg25_Over), lty = 'dashed', lwd = 2, col = 'red')
```
## Analysis of Secondary Effects
Throughout the analyses above, we began to identify that some of the relationships found between `deathRate` and other variables may not only be capturing the direct relationship between these variables, but also those of additional variable(s) that may be impacting both. To further assess this systematically, the following network visualization shows the variables that have a correlation higher than 0.4, where each node represents a different variable and each vertex indicates the strength of the relationship between the variables connected.
#![secondary_analysis]
#(secondary_analysis.png)
### Age and family/householdd
`PercentMarried` has a (weak) relation and `AvgHouseholdSize` has a moderate relation with `MedianAge`. Based on these results, we explored this relationship further.
```{r collapse=TRUE, comment=NA}
cor(subset(Cancer,
select = c("MedianAge", "AvgHouseholdSize", "PercentMarried")),
use = "pairwise.complete.obs")[1, ]
```
Both the scatterplots below and the regression lines imposed on them provide further support that there is indeed a relationship between these two variabes and `MedianAge`, indicating that `MedianAge` may confound the relationship between these two and `deathRate`. Therefore, this should be taken into account when modelling the relationships of interest, in order to isolate the effect of the family variables on the death rate.
```{r collapse=TRUE, comment=NA}
plot(Cancer$MedianAge, Cancer$PercentMarried, main = "Age vs PercentMarried")
abline(lm(PercentMarried ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
```
```{r collapse=TRUE, comment=NA}
plot(Cancer$MedianAge, Cancer$AvgHouseholdSize, main = "Age vs Average household size")
abline(lm(AvgHouseholdSize ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
```
### Black population and employment
Correlation analysis shows that there is a relationship between the percentage of black population and employment, which is further confirmed both by a visual inspection of the scatterplot and the linear regression line charted in this plot. Since employment es related to `deathRate`, its correlation with `PctBlack` may indicate that this variable may be confounding the relationship of interest and thus further modeling needs to take this into account, to isolate the effect of unemployment on death rate.
```{r collapse=TRUE, comment=NA}
cor(subset(Cancer, select = c("PctBlack", "PctUnemployed16_Over")),
use = "pairwise.complete.obs")[, 1]
```
```{r collapse=TRUE, comment=NA}
plot(Cancer$PctBlack, Cancer$PctUnemployed16_Over, main = "Age vs PercentMarried")
abline(lm(PctUnemployed16_Over ~ PctBlack, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
```
### State
A boxplot containing different location measures of `deathRate` by `State` shows that these values vary measures vary significantly across state. Since `State` may be capturing several state-level characteristics that may in turn affect other variables that have a relation with `deathRate`, it is recommended to include state-level effects when modeling the relation of interest, to control for confounding these state-level factors.
```{r collapse=TRUE, comment=NA}
boxplot(Cancer$deathRate ~ Cancer$State)
```