generated from andreashandel/online-portfolio-template
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathdataanalysis-exercise.qmd
201 lines (128 loc) · 6.86 KB
/
dataanalysis-exercise.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Data Analysis Exercise - Smoking and Tobacco Use"
format: html
editor: visual
---
# About the Dataset
This data set is the Behavioral Risk Factor Surveillance System (CDC BRFSS) Smoking and Tobacco Use from 1995 to 2010. It includes the rate by which each state (and US territory) exhibits different smoking habits. The rates are weighted to population characteristics to allow for comparison of different population sizes. It includes variables such as **"Smokes everyday"** **"Former Smoker"** **"Smokes some days"** and **"Never Smoked"**. I chose it because it was complete and relatively clean, as well as spanned over a large amount of years. This allowed for more than 53 rows which I saw in multiple other data sets. I wanted to use something a bit larger than only having one entry per state.
This data has been used to create [data visualizations also published on the CDC website](https://data.cdc.gov/Smoking-Tobacco-Use/2010-BRFSS-Smokes-Everyday-Map/exwk-rjkz) if you're interested in looking more into this data.
# Processing the Data
## Reading in the Data
```{r include = FALSE}
library(tidyverse)
library(readr)
library(ggplot2)
library(knitr)
library(here)
```
**Tobacco Use - Smoking Data for 1995-2010**
*A Note* I commented out the code chunk to only display the data table.
```{r echo=FALSE}
tobacco<-read_csv("data/BRFSS_Tobacco_Use_Smoking_Data_1995_2010.csv")
head(tobacco, n=10)
```
We can see that there are two potential location variables, one which also includes latitude and longitude. I want to see if these locations are different, and separate the coordinates from the location. I'm also not sure what the coordinates mean -- they may be the midpoint of the state, the capital, the data collection center, etc.
A quick Google search for the first set of coordinates in Oregon showed that they point to a potential state midpoint. This is confirmed by also looking at those for Indiana (my hometown!) and being placed in the heart of downtown Indianapolis in the center of the state and those for Georgia and being placed in Macon.
We can also see 4 entries which are only coordinates for 2009-2010 Guam and Virgin Islands, so we need to make sure these get moved appropriately to the correct collumns when they are not structured the same way as the rest of the data.
## Cleaning the Data
```{r}
#Separate location from coordinates
tobacco_clean1<-separate(tobacco, col = `Location 1`, into = c("Location", "LatLong"), sep = "\n")
```
There are a couple warnings where there is no data for this field. I'm not super worried about those, I more just want to investigate and standardize the data that exists.
```{r}
#account for data with different structures
tobacco_clean1$LatLong<-ifelse(tobacco_clean1$Year %in% c(2009, 2010) &
tobacco_clean1$State %in% c("Virgin Islands", "Guam"),
tobacco_clean1$Location,
tobacco_clean1$LatLong)
tobacco_clean1$Location<-ifelse(tobacco_clean1$Year %in% c(2009, 2010) &
tobacco_clean1$State %in% c("Virgin Islands", "Guam"),
NA,
tobacco_clean1$Location)
#separate from each other coordinates
tobacco_clean2<-separate(tobacco_clean1, col = LatLong, into = c("Lat", "Long"), sep = ",")
tobacco_clean2$Long<-str_sub(tobacco_clean2$Long, 1, str_length(tobacco_clean2$Long)-1)
tobacco_clean2$Lat<-str_sub(tobacco_clean2$Lat, 2, -2)
#look at similarities/differences in Location and State
different_locs<-tobacco_clean2%>%
filter(State != Location)
different_locs #none!
```
Great! Since there are no unusual or different inputs we can remove one of the duplicate columns
```{r}
tobacco_clean_F<-tobacco_clean2%>%
select(-Location)
```
## Wide to Tall
Next, since each rate is its own column, this may make it difficult to analyze and compare, so I want to change it to wide-to-tall format.
```{r}
tobacco_tall<-gather(tobacco_clean_F, SmokeAmount, Rate, `Smoke everyday`:`Never smoked`)
```
## Data Table for Tobacco Use - Smoking Data for 1995-2010
```{r echo=FALSE}
#The lat and long will look super long and clunky in a data table, so I want to round them to be pretty for presentation
tobacco_tall$Lat<-as.numeric(tobacco_tall$Lat)
tobacco_tall$Lat<-round(tobacco_tall$Lat, 4)
tobacco_tall$Long<-as.numeric(tobacco_tall$Long)
tobacco_tall$Long<-round(tobacco_tall$Long, 4)
#Data Table
rmarkdown::paged_table(tobacco_tall)
```
```{r, include=FALSE}
data_location <- here::here("data","AK_tobacco.RData")
#Export RDS File
#save(tobacco_tall, tobacco, file = data_location)
load(data_location) #reload as needed
```
This will be our dataset for analysis! It allows us to group by the "SmokeAmount" variable while having a consistent variable of interest "Rate" among all categories.
# Data Analysis
Since this data covers from 1995 to 2010, I want to first look at these variables over time. Below are plots for each category across all years.
```{r}
ggplot()+
geom_line(aes(x=Year, y=Rate, group=State), data=tobacco_tall, alpha = 0.2, color = "blue")+
facet_wrap(.~SmokeAmount)+
theme_bw()
```
This is a bit muddy of a plot, but we can see the general trends in rates between each category. It seems like among most states it is most common to not smoke. Former smoker and Smoking Everyday seem pretty comparable with smoking everyday on the decline. Finally smoking some days is the least common, but there does seem to be a very slight increase since 1995.
This is where my work ends, best of luck to Player 2.
## Kelly Hatfield's Section
### Step 1: Viewing the Tobacco R data
```{r}
summary(tobacco_tall)
```
### Step 2: See how many years and states are represented.
```{r}
table(tobacco_tall$SmokeAmount)
tobacco_tall_NS = subset(tobacco_tall, SmokeAmount == "Never smoked")
tobacco_tall_NS_2000=subset(tobacco_tall_NS, Year == 2000)
table(tobacco_tall_NS$State)
table(tobacco_tall_NS$Year)
```
### Step 3: Create a table of average rate of Never Smokers by year
```{r}
results <- aggregate(tobacco_tall_NS$Rate, list(tobacco_tall_NS$Year), FUN=mean)
results <- tobacco_tall_NS %>%
group_by(Year)%>%
summarise_at(vars(Rate), list('Average'=mean))
#
library(knitr)
knitr::kable(results, caption = "Percent of Non-Smokers by State", digits=2)
#Change Never smoked variable name
#
```
### Step 4: Create box plots and spaghetti plots showing rates of % Never smoked for states by year
```{r}
#Change Never smoked variable name
ggplot(tobacco_tall_NS, aes(x=factor(Year), y=Rate)) + geom_boxplot() + ylim(0,100)
ggplot(tobacco_tall_NS, aes(x=(Year), y=Rate, group=State)) + geom_line() + ylim(0,100)
#
```
### Step 5: Print Top 5 States with highest percentage of never smokers in 2010
```{r}
#
library(dplyr)
sorted_data <- subset(tobacco_tall_NS_2000,select=c(State,Year,Rate))
sorted_data2 <- top_n(sorted_data,5,Rate)
print(sorted_data2)
```