forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
421 lines (341 loc) · 18 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
---
title: "Reproducible Research: Peer-Assessed Project 1"
author: "Christopher Jones"
#date: "`r format(Sys.time(), '%d %B, %Y')`"
date: "10/30/2017"
output:
html_document:
echo: true
code_folding: hide
keep_md: true
#runtime: shiny
---
## Activity Data Analysis {.tabset}
### Loading/preprocessing {.tabset}
Loading the data is simple. Light preprocessing is done, performing the following actions:
* adding a row ID column
* converting date column to actual date data type
* add a column for day of week
* create a dataset consisting of only complete observations
```{r data_load}
# =======================
# Data load/preprocessing
# =======================
# download data if it isn't already in place in current directory
filename <- "activity.csv"
zipurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
if (!file.exists(filename)) {
download <- download.file(zipurl, destfile = "temp.zip")
unzip("temp.zip")
unlink("temp.zip")
}
# pre-preprocessing
data.raw <- read.csv(filename, header=TRUE)
data.proc <- data.raw
data.proc$id <- seq(1:nrow(data.proc))
data.proc$date <- as.Date(data.proc$date)
data.proc$dow <- weekdays(data.proc$date, abbreviate=TRUE)
data.proc$dow <- as.factor(data.proc$dow)
data.proc.complete <- data.proc[!is.na(data.proc$steps),]
message("Data load/preprocessing complete.")
```
Further processing (aggregations etc.) is performed later, specific to the requirements of each topic.
### Mean steps per day {.tabset}
Instructions:
>What is mean total number of steps taken per day?
>
>For this part of the assignment, you can ignore the missing values in the dataset.
>
>Make a histogram of the total number of steps taken each day
>
>Calculate and report the mean and median total number of steps taken per day
The required information is provided by the plot below, using the complete observations dataset, per the instructions.
```{r steps_per_day}
# =================================================
# What is mean total number of steps taken per day?
# For this part of the assignment, you can ignore the missing values in the dataset.
# Calculate the total number of steps taken per day
# If you do not understand the difference between a histogram and a barplot, research the difference between them. Make a histogram of the total number of steps taken each day
# Calculate and report the mean and median of the total number of steps taken per day
# =================================================
# set up the data
stepsperday <- aggregate(data.proc.complete$steps, by=list(data.proc.complete$date), FUN=sum)
names(stepsperday) <- c("date", "steps")
# construct the raw plot
plot.new()
par(bg = "grey")
hist(stepsperday$steps, breaks=10, main="", xaxt = "n", xlab = "# Of Steps", col = "cornflowerblue")
axis(1, at=seq(0, 25000, 5000), cex.axis =.75)
title(main="Histogram Of Total Steps Per Day")
mtext(bquote(mu
~ "="
~ .(format(mean(stepsperday$steps), big.mark=","))
~ ", "
~ sigma
~ "="
~ .(format(sd(stepsperday$steps), big.mark = ","))
~ ", median ="
~ .(format(median(stepsperday$steps), big.mark = ","))
)
)
box()
```
### Daily activity pattern {.tabeset}
Instructions:
>What is the average daily activity pattern?
>
>Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
>
>Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
In addition to answering the questions given, I've provided a smoothing line. The average daily activty pattern is depicted below. On average, the maximum number of steps occurs at interval 835.
```{r daily_activity_pattern}
# ==========================================
# What is the average daily activity pattern?
# Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
# Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
# ==========================================
library(TTR) # for exponentially weighted moving average (higher weight to more recent observations)
# set up the data
avgstepsperinterval <- aggregate(data.proc.complete$steps, by=list(data.proc.complete$interval), FUN=mean)
names(avgstepsperinterval) <- c("interval", "avg_steps")
avgstepsperinterval$expMA <- EMA(avgstepsperinterval$avg_steps)
themean <- mean(avgstepsperinterval$avg_steps)
themax <- max(avgstepsperinterval$avg_steps)
themaxpre <- avgstepsperinterval$interval[avgstepsperinterval$avg_steps == themax]
thesd <- sd(avgstepsperinterval$avg_steps)
# construct the plot
windows(height = 5.5, width = 6)
plot.new()
par(bg = "grey")
plot(avgstepsperinterval$interval, avgstepsperinterval$avg_steps, type="l", col="blue", xaxt = "n", xlab="Interval", ylab="Avg. Daily Steps")
axis(1, at=seq(0, 3500, 100), cex.axis =.75)
points(x=themaxpre, y=themax, pch=19, col="forestgreen")
text(x=1.05 * themaxpre
, y=.99 * themax
, labels=bquote("maximum: (interval, avg steps) = ("
~ .(themaxpre)
~ ", "
~ .(round(x=themax, digits=2))
~ ")"
)
, adj = c(0,0)
, cex=.75)
title(main="Average Steps Per Day, By Time Interval")
mtext("(with exponential moving average smoothing)")
# add exponential moving average line for some smoothing visualization
lines(x=avgstepsperinterval$interval, y=avgstepsperinterval$expMA, type="l", col="brown1")
rect(xleft=0, xright=650, ybottom=170, ytop=205)
legend(1, 210, legend=c("Data", "Exponential\nRunning Average"), col=c("blue", "brown1"), lty=c(1, 1), cex=0.7, bty="n")
```
### Missing values {.tabeset}
Instructions:
>Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
>
>Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
>
>Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated.
>
>For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
>
>Create a new dataset that is equal to the original dataset but with the missing data filled in.
>
>Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?
For missing values, we use the predictive mean matching method from the commonly-used mice package (2 data sets, 10 iterations).
First, md.pattern shows us what columns contain missing values:
```{r missing_values}
# use the standard package for imputing the missing steps values
library(lattice)
library(mice)
md.pattern(data.raw)
```
Finally, we perform the imputation, and produce a before/after comparative visual:
```{r imputed_values}
# =======================
# Imputing missing values
# Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
# Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
# Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
# Create a new dataset that is equal to the original dataset but with the missing data filled in.
# Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.
# Do these values differ from the estimates from the first part of the assignment?
# What is the impact of imputing missing data on the estimates of the total daily number of steps?
# ======================
if(!exists("data.proc.imputed")) {
tempData <- mice(data.raw, m=2, maxit=10, meth="pmm", seed=500)
data.proc.imputed <- complete(tempData, 1)
data.proc.imputed$id <- seq(1:nrow(data.proc.imputed))
data.proc.imputed$date <- as.Date(data.proc.imputed$date)
data.proc.imputed$dow <- weekdays(data.proc.imputed$date, abbreviate=TRUE)
data.proc.imputed$dow <- as.factor(data.proc.imputed$dow)
}
if(nrow(data.proc.imputed) != 17568) {
tempData <- mice(data.raw, m=2, maxit=10, meth="pmm", seed=500)
data.proc.imputed <- complete(tempData, 1)
data.proc.imputed$id <- seq(1:nrow(data.proc.imputed))
data.proc.imputed$date <- as.Date(data.proc.imputed$date)
data.proc.imputed$dow <- weekdays(data.proc.imputed$date, abbreviate=TRUE)
data.proc.imputed$dow <- as.factor(data.proc.imputed$dow)
}
# set up the imputed data
stepsperday.imputed <- aggregate(data.proc.imputed$steps, by=list(data.proc.imputed$date), FUN=sum)
names(stepsperday.imputed) <- c("date", "steps")
# verify we've at least nominally resolved the NA problem
# > nrow(data.proc.imputed[is.na(data.proc.imputed),])
# [1] 0
# construct the raw plot
plot.new()
col.raw <- "firebrick4"
col.imputed <- "dodgerblue1"
col.imputed.alpha <- .4
myrgb <- col2rgb(col.imputed)/256 * col.imputed.alpha + col2rgb(col.raw)/256 * (1 - col.imputed.alpha)
col.mixed <- rgb(myrgb[1], myrgb[2], myrgb[3], 1)
par(bg = "grey")
hist(stepsperday$steps, ylim=c(0, 20), breaks=10, main="", xaxt = "n", xlab = "# Of Steps", col = col.raw)
rug(stepsperday$steps, side=1, col=col.raw)
axis(1, at=seq(0, 25000, 5000), cex.axis =.75)
box()
# construct imputed plot
par(bg = "grey")
hist(stepsperday.imputed$steps, ylim=c(0, 20), breaks=10, main="", xaxt = "n", xlab = "# Of Steps", col = rgb(col2rgb(col.imputed)[1]/256, col2rgb(col.imputed)[2]/256, col2rgb(col.imputed)[3]/256, col.imputed.alpha), add=TRUE)
rug(stepsperday.imputed$steps, side=3, col=col.imputed)
axis(1, at=seq(0, 25000, 5000), cex.axis =.75)
title(main="Total Steps Per Day, Raw & Imputed", line=2.5)
data1 <- " Raw:"
data2 <- "Imputed:"
mytext1 <- list(data2, data1)
mtext(do.call(expression, mytext1), side=3, line=0:1, adj=c(.23, .23), cex=.75)
bq1 <- bquote(" " ~ mu
~ "="
~ .(format(round(mean(stepsperday$steps), digits=2), big.mark=","))
~ " "
~ sigma
~ "="
~ .(format(round(sd(stepsperday$steps), digits=2), big.mark = ","))
~ " median ="
~ .(format(median(stepsperday$steps), big.mark = ","))
)
bq2 <- bquote(" " ~ mu
~ "="
~ .(format(round(mean(stepsperday.imputed$steps), digits=2), big.mark=","))
~ " "
~ sigma
~ "="
~ .(format(round(sd(stepsperday.imputed$steps), digits=2), big.mark = ",", digits = 6))
~ " median ="
~ .(format(median(stepsperday.imputed$steps), big.mark = ","))
)
mytext <- list(bq2, bq1)
mtext(do.call(expression, mytext), side=3, line=0:1, adj=c(.6, .6), cex=.75)
legend(19000, 18, legend=c("Raw", "Imputed", "Both"), col=c(col.raw, col.imputed, col.mixed), cex=0.75, pch=c(22, 22, 22), pt.bg = c(col.raw, col.imputed, col.mixed))
box()
```
This plot contains 2 parts: a before/after histogram, and before/after linear density plots along the top and bottom borders.
Overall the imputation didn't affect the character of the data very much (at least not visible in the histogram). Most of the new weight was added above the mean/median (they're close together), so the immputed mean/median ticked upwards slightly. And since most of the added weight was near the mean/median, the standard deviation shows a bit of a decrease.
### Weekdays vs Weekends {.tabset}
Instructions:
>For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
>
>Create a new factor variable in the dataset with two levels -- "weekday" and "weekend" indicating whether a given date is a weekday or weekend day.
>
>Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
We break out weekdays and weekends in the plots below.
While both exhibit early spikes (around intervale 800), the average weekend spike is significantly lower than the weekday spike. Additionally, activity on the weekend is much more uniformly distributed over its support than weekday activity. Finally, the weekends show much more activity in the initial 0-500 intervals than do the weekends.
We hypothesize that these characteristics indicate the difference between work days and non-work days. More exploration would be needed to support/confirm this, first perhaps being verification that the differences noted were not due merely to the addition of imputed values.
```{r weekdays_weekends}
# ======================
# Are there differences in activity patterns between weekdays and weekends?
# For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
# Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
# Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). See the README file in the GitHub repository to see an example of what this plot should look like using simulated data.
# ======================
library(TTR) # for exponentially weighted moving average (higher weight to more recent observations)
# set up the data
weekdays <- c("Mon", "Tue", "Wed", "Thu", "Fri")
weekends <- c("Sat", "Sun")
# Weekdays data aggregation
avgstepsperinterval.wkdys <- aggregate(data.proc.imputed[data.proc.imputed$dow %in% weekdays,]$steps, by=list(data.proc.imputed[data.proc.imputed$dow %in% weekdays,]$interval), FUN=mean)
names(avgstepsperinterval.wkdys) <- c("interval", "avg_steps")
avgstepsperinterval.wkdys$expMA <- EMA(avgstepsperinterval.wkdys$avg_steps)
themean.wkdys <- mean(avgstepsperinterval.wkdys$avg_steps)
themax.wkdys <- max(avgstepsperinterval.wkdys$avg_steps)
themaxpre.wkdys <- avgstepsperinterval.wkdys$interval[avgstepsperinterval.wkdys$avg_steps == themax.wkdys]
thesd.wkdys <- sd(avgstepsperinterval.wkdys$avg_steps)
# Weekends data aggregation
avgstepsperinterval.wkends <- aggregate(data.proc.imputed[data.proc.imputed$dow %in% weekends,]$steps, by=list(data.proc.imputed[data.proc.imputed$dow %in% weekends,]$interval), FUN=mean)
names(avgstepsperinterval.wkends) <- c("interval", "avg_steps")
avgstepsperinterval.wkends$expMA <- EMA(avgstepsperinterval.wkends$avg_steps)
themean.wkends <- mean(avgstepsperinterval.wkends$avg_steps)
themax.wkends <- max(avgstepsperinterval.wkends$avg_steps)
themaxpre.wkends <- avgstepsperinterval.wkends$interval[avgstepsperinterval.wkends$avg_steps == themax.wkends]
thesd.wkends <- sd(avgstepsperinterval.wkends$avg_steps)
# construct the plots
plot.new()
layout(matrix(c(1,2), 2, 2, byrow=FALSE), widths=c(1,1), heights=c(4,4))
par(bg = "grey")
par(oma=c(3,3,3,3)) # all sides have 3 lines of space
par(mar=c(.1,4,3,2) + 0.1)
# Weekdays
plot(avgstepsperinterval.wkdys$interval, avgstepsperinterval.wkdys$avg_steps, type="l", ylim=c(0, 250), col="blue", yaxt="n", xaxt = "n", xlab="", ylab="Avg. Daily Steps")
axis(3, at=seq(0, 2400, 100), cex.axis =.60)
axis(2, at=seq(0, 250, 50), cex.axis =.75)
points(x=themaxpre.wkdys, y=themax.wkdys, pch=19, col="forestgreen")
text(x=1.05 * themaxpre.wkdys
, y=.99 * themax.wkdys
, labels=bquote("maximum: (interval, avg steps) = ("
~ .(themaxpre.wkdys)
~ ", "
~ .(round(x=themax.wkdys, digits=2))
~ ")"
)
, adj = c(0,0)
, cex=.75)
text(x=0, y=225, labels="Weekday", cex=1, adj=c(0,0))
y.tmp <- grconvertY(250, to='ndc')
# add exponential moving average line for some smoothing visualization
lines(x=avgstepsperinterval.wkdys$interval, y=avgstepsperinterval.wkdys$expMA, type="l", col="brown1")
# Weekends
plot(avgstepsperinterval.wkends$interval, avgstepsperinterval.wkends$avg_steps, type="l", ylim=c(0, 250), col="blue", yaxt="n", xaxt = "n", xlab="Interval", ylab="Avg. Daily Steps")
axis(1, at=seq(0, 2400, 100), cex.axis =.60)
axis(2, at=seq(0, 250, 50), cex.axis =.75)
points(x=themaxpre.wkends, y=themax.wkends, pch=19, col="forestgreen")
text(x=.75 * themaxpre.wkends
, y=1.05 * themax.wkends
, labels=bquote("maximum: (interval, avg steps) = ("
~ .(themaxpre.wkends)
~ ", "
~ .(round(x=themax.wkends, digits=2))
~ ")"
)
, adj = c(0,0)
, cex=.75)
text(x=0, y=225, labels="Weekend", cex=1, adj=c(0,0))
# add exponential moving average line for some smoothing visualization
lines(x=avgstepsperinterval.wkends$interval, y=avgstepsperinterval.wkends$expMA, type="l", col="brown1")
# sync lines between plots
par(xpd=NA)
numlines <- 11
plotwidth <- 2400
segments(seq(from = plotwidth/(numlines+1)
, to = plotwidth-(plotwidth/(numlines+1))
, by = plotwidth/(numlines+1))
, rep(-10, 4)
, seq(from = plotwidth/(numlines+1)
, to = plotwidth-(plotwidth/(numlines+1))
, by = plotwidth/(numlines+1))
, rep(1.0*grconvertY(y.tmp, from='ndc'), 4)
, lty='dashed'
, col='gray65')
# legend
par(xpd=TRUE)
legend(800, 325
, legend = c("Data", "Smoothed")
, col = c("blue", "brown1")
, lwd = 2
, cex = .75
, horiz = TRUE
)
par(xpd=NA)
# title
mtext("Weekday vs Weekend: Average Steps per Day, By Time Interval", outer=TRUE, cex=1)
```
fin