forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
139 lines (120 loc) · 5.54 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading necessary packages
Two packages are needed - **tidyverse** to work with data and **lubridate** to work with dates. As the operating system is working in a local language, the R studio session needs to be set to English, in order to properly work with the names of the week.
```{r results="hide", warning=FALSE, message=FALSE,echo = TRUE}
library(tidyverse)
library(lubridate)
Sys.setlocale("LC_TIME", "C")
```
## Loading and preprocessing the data
Reading the data from .csv file and transforming to a tibble:
```{r, echo = TRUE}
activity_data <- read.csv("activity.csv")
activity <- as_tibble(activity_data)
```
Changing the format of the date from character to the date format:
```{r, echo = TRUE}
activity %>% mutate(date = ymd(date))
```
## What is mean total number of steps taken per day?
Calculating the total number of steps for each day:
```{r, echo = TRUE}
activity_sum <- activity %>% group_by(date) %>%
filter(!is.na(steps))%>%
summarise(sum_step = sum(steps))
```
### Plotting the the number of steps per day
Histogram of the total number of steps taken each day - based on the previously calculated sum of the steps per day:
```{r warning=FALSE, message=FALSE, echo = TRUE}
plot1 <- ggplot(activity_sum, aes(x = sum_step)) +
geom_histogram(color="darkblue", fill="lightblue") +
xlab("Number of steps") +
scale_y_continuous(breaks = seq(0, 8, by = 2)) +
ggtitle("Number of steps per day")
plot1
```
Calculation of the mean and median of the total number of steps taken per day:
```{r, echo = TRUE}
activity_summary <- activity_sum %>%
summarise(mean=mean(sum_step), median = median(sum_step))
print(activity_summary)
```
The mean and median are 10766 and 10765, respectively, so they are quite similar. This means that the data are evenly distributed.
## What is the average daily activity pattern?
Calculation and plotting the mean number of steps per interval across all dates:
```{r, echo = TRUE}
activity_int <- activity %>% group_by(interval) %>%
filter(!is.na(steps))%>%
summarise(avg = mean(steps))
plot2 <- ggplot(activity_int, aes(x = interval, y = avg)) +
geom_line(color="darkblue") +
ylab("Number of steps") +
ggtitle("Average number of steps per 5-min interval")
plot2
```
Finding the interval with the maximal number of steps:
```{r, echo = TRUE}
activity_max <- activity_int %>%
filter(avg == max(avg))
activity_max
```
The interval with the maximal mean number of steps accross all days is the interval 835 (with the mean of 206 steps).
## Imputing missing values
Finding the number of missing values in the original dataset:
```{r, echo = TRUE}
sum(is.na(activity))
```
There are 2304 missing values in the original dataset.
To fill the missing values the mean number of steps for the same interval calculated from the available data from other days was used. This method allows to get the missing values from the same time of the day so it should give the most reliable results.
```{r, echo = TRUE}
activity_noNA <- activity %>%
group_by(interval) %>%
mutate(steps = ifelse(is.na(steps), mean(steps, na.rm = T), steps))
activity_noNA <- activity_noNA %>% mutate(date = ymd(date))
activity_sum_noNA <- activity_noNA %>% group_by(date) %>%
summarise(sum_step = sum(steps))
activity_summary_noNA <- activity_sum_noNA %>%
summarise(mean=mean(sum_step), median = median(sum_step))
print(activity_summary_noNA)
```
The mean and median after filling in the missing values are 10766 and 10766, respectively, so they are almost identical to the values previously calculated, when the NAs were ignored. This is the results of the chosen method of missing values calculation, based on the data from the same intervals from the days for which the data was available.
Plotting the the number of steps per day (after filling the missing data):
```{r warning=FALSE, message=FALSE, echo = TRUE}
plot3 <- ggplot(activity_sum_noNA, aes(x = sum_step)) +
geom_histogram(color="darkblue", fill="lightblue") +
xlab("Number of steps") +
scale_y_continuous(breaks = seq(0, 8, by = 2)) +
ggtitle("Number of steps per day")
plot3
```
## Are there differences in activity patterns between weekdays and weekends?
Defining the weekdays:
```{r, echo = TRUE}
weekdays <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
```
Classification of days into previously defined weekdays and the other (non-defined) days into weekends:
```{r, echo = TRUE}
activity_wk <- activity_noNA %>%
mutate(weekday = weekdays(date)) %>%
mutate(type = if_else(weekday == "Saturday" |
weekday == "Sunday", "weekend", "weekday"))
```
Calculating and plotting the mean number of steps per interval across all dates:
```{r warning=FALSE, message=FALSE, echo = TRUE}
activity_int_wk <- activity_wk %>% group_by(type, interval) %>%
summarise(avg = mean(steps))
```
```{r, echo = TRUE}
plot4 <- ggplot(activity_int_wk, aes(x = interval, y = avg)) +
geom_line(color="darkblue") +
ylab("Number of steps") +
ggtitle("Average number of steps per 5-min intervals") +
facet_grid(rows = vars(type))
plot4
```
The daily activity patterns between weekdays and weekends are somehow similar, but there are also several differences visible. Activity starts earlier during weekdays than on the weekends and during the weekdays there is a high peak of activity early in the day, which is not visible during the weekends.