generated from andreashandel/online-portfolio-template
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathdataanalysis_exercise.qmd
137 lines (105 loc) · 4.83 KB
/
dataanalysis_exercise.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Data Analysis Exercise - Botulism"
format: html
editor: visual
---
**Introduction**
This data is from the CDC website at https://data.cdc.gov/Foodborne-Waterborne-and-Related-Diseases/Botulism/66i6-hisz
and has counts of confirmed botulism cases in the United States by state, year, type of botulism toxin, and transmission type. I will be cleaning an working with all 5 variables.
```{r}
#load tidyverse
library(tidyverse)
#read in the data and assign to botulismdata
botulismdata<-read_csv("dataanalysis-exercise/rawdata/Botulism.csv")
#checking structure and summary of data set
str(botulismdata)
summary(botulismdata)
```
**Change to Factors**
BotType and ToxinType are both classified as characters but would be better represented as factors so I will change them.
```{r}
#change BotType and ToxinType to factor variables
botulismdata$BotType <- as.factor(botulismdata$BotType)
botulismdata$ToxinType <- as.factor(botulismdata$ToxinType)
#confirm they were both changed
str(botulismdata)
```
**Rename Columns**
Next, I think the BotType and ToxinType variables can be renamed to be a little clearer.
```{r}
#Rename the BotType and ToxinType variables
botulismdata <- rename(botulismdata, "Transmission Type" = "BotType")
botulismdata <- rename(botulismdata, "Toxin Type" = "ToxinType")
#Confirm the names changed correctly
str(botulismdata)
```
**Check for Missing Data**
```{r}
#Check is there is any missing data
colSums(is.na(botulismdata))
```
**Remove NAs**
All the NA entries appear to be in the state column. There are 2280 observations so I decided to just remove the 34 NA values since they don't make up a large percentage of the data overall.
```{r}
#Drop the NA values in the state column
botulismdata <- drop_na(botulismdata, State)
#Confirm number of observations dropped from 2280 to 2246
str(botulismdata)
```
```{r}
##install and load psych package to use describe to see a summary of data, also run summary
library(psych)
summary(botulismdata)
describe(botulismdata)
```
**Save as RDS file**
```{r}
saveRDS(botulismdata, file="cleanbotulismdata.RData")
```
## Abbie's Work
**1: Load in Data**
```{r}
AKbotulism<-readRDS("cleanbotulismdata.RData") #load() the dataset was shooting errors so I renamed it to read it in
head(AKbotulism)
```
I'm interested to see how the toxin types and counts varied by year.
**Grouped By State**
```{r}
library(ggplot2)
ggplot()+
geom_line(aes(x=Year, y=Count, group = State), data=AKbotulism, alpha = 0.2)+
facet_wrap(.~`Toxin Type`) +
theme_bw()
```
Well that was confusing, let's try it grouped by Transmission Type.
**Grouped by Transmission Type**
```{r}
ggplot()+
geom_line(aes(x=Year, y=Count, group = `Transmission Type`), data=AKbotulism, alpha = 0.2)+
facet_wrap(.~`Toxin Type`) +
theme_bw()
```
We can see some better patterns here if we squint really hard, but faceting by toxin type allows for too many options to get a good idea of what is happening within the data. Toxins A, B, E, and Unknown seem to be the most prevelant types, so I want to focus in on them
### Subset for specific toxins
```{r}
ABE<-AKbotulism%>%
filter(`Toxin Type` %in% c("A", "B", "E", "Unknown"))
```
Let's try that again:
**Grouped by Transmission Type**
```{r}
ggplot()+
geom_line(aes(x=Year, y=Count, group = `Transmission Type`), data=ABE, alpha = 0.35)+
facet_wrap(.~`Toxin Type`) +
theme_bw()
```
Awesome, now we can start to see the patterns and the ranges better among the data. It seems that Type A is popular and only continues to grow in prevalence. B seems to as well though on a smaller scale. I'm not really sure what's going on with E, and the number of unknown seems to be decreasing. This may be because these cases are becoming better identified and may account for some of the increase in A and B types (all right around 1975).
While I grouped by Transmission Type to better see the data, that grouping isn't telling us much at the moments, so let's see if there's any patterns within it.
```{r}
ggplot()+
geom_point(aes(x=Year, y=Count, color = `Transmission Type`), data=ABE, alpha = 0.35)+
facet_wrap(.~`Toxin Type`) +
theme_bw()
```
This is pretty cool! Most of the new cases for A and B are transmitted by infants with foodborne transmission nearly dying out overnight. The number of unknown types of foodborne transmission also snuffed out around the same time - I wonder if there was new food safety legislation in place that would explain the mass decrease. However, Type E remains mainly foodborne and pretty constant across time. Also with the drastic increase in infant transmission around when foodborne transmission ended I wonder if the classifications for transmission were changed. There's no instance of "infant" transmission until 1975.
I don't know enough about botulism to provide more commentary, but this posed some interesting questions I'll keep in mind if this topic arises again.