forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pa1_template.Rmd
143 lines (111 loc) · 5.01 KB
/
pa1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# Analysis and process
## Data and packages
Download the raw [activity monitoring data](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip) and place file in your working directory.
Load the following packages. If packages have not been installed, you will need to install them at this point using install.packages().
```{r, results= 'hide', message=FALSE}
library('lubridate')
library('ggplot2')
library('dplyr')
library('knitr')
library('markdown')
```
Load data with the following code. Change 'date' column to class 'date'.
``` {r, results='hide'}
data <- read.csv('activity.csv')
data$date <- as.POSIXct(data$date)
```
## Steps per day analysis
Let's see what the distribution of steps per day looks like using a histogram.
```{r plot1, warning=FALSE}
plot1 <- ggplot(data, aes(x = date, y = steps)) +
geom_histogram(stat = 'identity') +
labs(title = "Steps Per Day", x = 'Date', y = 'Number of Steps') +
scale_x_datetime()
plot1
```
To calculate the mean and median steps per day, the data must be transformed.
```{r, results='hide'}
tot.steps.day <- data %>%
group_by(date) %>%
summarise(tot.steps = sum(steps))
```
```{r, mean}
mean.steps.day <- mean(tot.steps.day$tot.steps, na.rm = TRUE)
median.steps.day <- median(tot.steps.day$tot.steps, na.rm = TRUE)
```
The **mean** steps per day is **`r as.character(round(mean.steps.day,1))`**.
The **median** steps per day is **`r as.character(median.steps.day)`**.
## Average daily pattern analysis
In order to plot the average number of steps taken taken per day, averaged across all days, the data must be transformed.
```{r, results = 'hide'}
int.mean <- data %>%
group_by(interval) %>%
summarise(mean.steps = mean(steps, na.rm = TRUE))
```
Now data is ready to be plotted with ggplot.
```{r plot2}
plot2 <- ggplot(int.mean, aes(x = interval, y = mean.steps)) +
theme(legend.position = 'none') +
geom_line(aes(color = mean.steps), size = 1) +
geom_hline(aes(yintercept = mean(int.mean$mean.steps))) +
annotate('text', x = 4, y = 45, label = 'mean') +
labs(title = 'Mean Steps Per Interval')
plot2
```
```{r, results = 'hide'}
max.int.steps <- filter(int.mean, mean.steps == max(mean.steps))
```
The interval with the maximum average steps per day is **`r max.int.steps$interval`** with **`r round(max.int.steps$mean.steps,0)`** steps.
## Dealing with missing values and imputation
```{r, results='hide'}
inc.cases <- sum(!complete.cases(data))
```
There are **`r inc.cases` missing values** indicated as NA in the data set. To compare the possible differences in analysis between the data set with missing values and a data set that contains no missing values, we impute the missing values.
The code below matches the interval containing NAs and imputes the average steps taken for said interval from the entire data set.
```{r, results='hide'}
data3 <- data
data3 <- left_join(data3, int.mean, by = 'interval')
data3$steps[which(is.na(data3$steps))] <- data3$mean.steps[is.na(data3$steps)]
```
After imputation, a new complete data set is ready to be ploted.
```{r plot 4}
plot4 <- ggplot(data3, aes(x = date, y = steps)) +
geom_histogram(stat = 'identity') +
labs(title = "Steps Per Day", x = 'Date', y = 'Number of Steps') +
scale_x_datetime()
plot4
```
Now, let's compared the difference of both mean and median between the incomplete data set and the new data set with imputed values.
```{r}
tot.steps.day.data3 <- data3 %>%
group_by(date) %>%
summarise(tot.steps = sum(steps))
imp.mean <- mean(tot.steps.day.data3$tot.steps)
imp.median <-median(tot.steps.day.data3$tot.steps)
data.steps.sum <- sum(data$steps, na.rm = TRUE)
data3.steps.sum <- sum(data3$steps)
```
The new **mean** is **`r as.character(round(imp.mean,1))`**.
The new **median** is **`r as.character(round(imp.median,1))`**.
There is not a huge difference compared to mean and median calculations of the original data set. One possible reason for this could be the imputation method used.
## Weekend vs weekday activity analysis
To analyze the difference in activity between weekdays and weekends, the data must be transformed once again. A new factor variable will be added with two levels; 'weekend' and'weekday.'
```{r, results='hide'}
data4 <- data3
data4$wday <- wday(data4$date)
data4$week.cat <- NA
data4$week.cat[which(data4$wday > 5)] <- 'weekend'
data4$week.cat[is.na(data4$week.cat)] <- 'weekday'
data4$week.cat <- as.factor(data4$week.cat)
int.data4 <- data4 %>%
group_by(week.cat, interval) %>%
summarise(mean.steps = mean(steps))
```
Now we can plot the differences between weekend and weekday activity.
```{r plot4}
plot5 <- ggplot(int.data4, aes(x = interval, y = mean.steps, group = week.cat)) +
geom_line() + facet_wrap(~week.cat, ncol = 1) +
labs(title = 'Weekday vs weekend: mean steps per internval', x = 'Interval', y = 'Mean steps')
plot5
```
This visualization shows that the weekday segment is more active in the earlier parts of the day. The weekend segment appears to be slightly more active throughout the entire day however.