forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
98 lines (79 loc) · 2.68 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Load packages
```{r }
library(chron)
library(ggplot2)
library(dplyr)
library(lubridate)
```
## Loading and preprocessing the data
```{r }
data<-read.csv("activity.csv", stringsAsFactors=FALSE)
data<-data %>% mutate(date=ymd(date))
```
## What is mean total number of steps taken per day?
```{r }
byDay<-data %>% group_by(date) %>% summarise(TotalSteps=sum(steps, na.rm = TRUE))
ggplot(data = byDay, aes(TotalSteps)) + geom_histogram(bins = 20)+ggtitle("Total steps per day frequency counts")
```
Mean of total steps by day
```{r }
mean(byDay$TotalSteps)
```
Median of total steps by day
```{r }
median(byDay$TotalSteps)
```
## What is the average daily activity pattern?
```{r }
byInterval<-data %>% group_by(interval) %>% summarise(MedianSteps=median(steps, na.rm=TRUE), MeanSteps=mean(steps, na.rm=TRUE))
ggplot(data = byInterval, aes(x=interval, y=MeanSteps)) + geom_line()
```
Interval with the maximum mean total steps
```{r }
byInterval[which.max(byInterval$MeanSteps),1]
```
## Imputing missing values
Total number of missing values
```{r }
sum(is.na(data$steps))
```
Impute some missing values. We replace the missing values with teh mean for that strategy, averaged over all days.
```{r }
inputed<-inner_join(x=data, y=byInterval, by=c("interval","interval"))
#we can use the mean for that interval if the value is missing
inputed$extra <- ifelse(!is.na(inputed$steps), inputed$steps, inputed$MeanSteps)
ibyDay<-inputed %>% group_by(date) %>% summarise(TotalSteps=sum(extra, na.rm = TRUE))
```
Total number of steps per day
```{r }
ggplot(data = ibyDay, aes(TotalSteps)) + geom_histogram(bin=20)+ggtitle("Total steps per day frequency counts")
```
Mean and median of total steps by day
```{r }
mean(ibyDay$TotalSteps)
median(ibyDay$TotalSteps)
```
The total number of steps has gone up, as the missing values were previously implicitly counted as zero. Also, the median is now equal to the mean, as we have used the mean to fill the missing values. This will skew the centre of the distribution to the mean.
## Are there differences in activity patterns between weekdays and weekends?
```{r }
inputed$isweekend<-as.factor(ifelse(is.weekend(inputed$date), "WeekEnd", "WeekDay"))
byIntervalw<-inputed %>% group_by(interval, isweekend) %>% summarise(MedianSteps=median(steps, na.rm=TRUE), MeanSteps=mean(steps, na.rm=TRUE))
qplot(interval,
MeanSteps,
data = byIntervalw,
type = 'l',
geom=c("line"),
xlab = "Interval",
ylab = "Number of steps",
main = "") +
facet_wrap(~ isweekend, ncol = 1)
```