PA1.RMD

---
title: "Reproducible Research : Peer Assessment Project 1"
author: "Romain Faure"
date: "January 2016"
output: 
  html_document:
    highlight: tango
    keep_md: yes
    number_sections: yes
    toc: yes
    toc_depth: 1
---

# Introduction

This peer assessment project is part of the [Reproducible Research](https://www.coursera.org/learn/reproducible-research/home/welcome) 4-weeks online course offered by the [Johns Hopkins Bloomberg School of Public Health](http://www.jhsph.edu/) (Baltimore, USA) on [Coursera](https://www.coursera.org). [Reproducible Research](https://www.coursera.org/learn/reproducible-research/home/welcome) is the fifth course of the 11-months [JH Data Science Specialization](https://www.coursera.org/specialization/jhudatascience/1). 

As you may know, it is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5-minute intervals through out the day. The assignment data consists of two months of data from an anonymous individual collected during the months of October and November 2012 and include the number of steps taken in 5-minute intervals each day.

# Downloading the raw data

The data for this assignment can be downloaded from the [course web site](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip), as I did with the script below :

```{r, echo=TRUE}
## REPRODUCIBLE RESEARCH - PROJECT 1
## Data downloading script

## check if a "data" directory exists in the current working directory
## and create it if not
if (!file.exists("data")) { 
    dir.create("data")
}

## download the file
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
download.file(fileUrl, "./data/activity.zip", method = "curl")

## unzip the file in the "data" sub-directory
unzip("./data/activity.zip", exdir = "./data/")

## delete the original zip file once it's been unzipped
unlink("./data/activity.zip")

## save and print the download time and date
downloadedDate <- date()
sprintf("File downloaded and unzipped on : %s", downloadedDate)
```

# Notes on the original raw data 

The downloaded file is a ZIP archive of 54 Ko. Once unzipped, the original raw data consists of one comma-separated-value (CSV) file of 351 Ko called "activity.csv".   
     
The original dataset is made of 17 568 observations :      
(Nb of measurements per hour * Nb of hours per day * Nb of days in October and November = Total Nb of measurements)
    
```{r}
(60 / 5) * 24 * (31 + 30)
```
    
The 3 variables included in this dataset are :

- ```steps``` *(integer)* : Number of steps taken in a 5-minute interval (missing values are coded as 𝙽𝙰)

- ```date``` *(factor)* : The date on which the measurement was taken in YYYY-MM-DD format

- ```interval``` *(integer)* : Identifier for the 5-minute interval in which the measurement was taken

# Loading and exploring the data

```{r, echo=TRUE}
activity1 <- read.csv(file = "./data/activity.csv")
```

```{r, echo=TRUE}
str(activity1)
```

As we can see above, ```steps``` and ```interval``` are integer variables, and ```date``` is a factor variable with 61 levels (31 days in October and 30 days in November).    
    
The raw dataset does not seeem to need preprocessing, apart from the NA issue in the ```steps``` variable (we'll take care of that later).

# Loading packages for the analysis

For this analysis we'll use the [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html) and [Lattice](https://cran.r-project.org/web/packages/lattice/index.html) packages, so they both need to be installed and loaded before starting the actual analysis :

```{r, echo=TRUE}
library(dplyr)
library(lattice)
```

# Total, mean and median number of steps taken each day

As we've already seen, the ```steps``` variable contains NAs : for example no data is available for the 1st or the 8th of October 2012, among other days. We will ignore the missing values in the dataset for this part of the assignment, hence the use of the ```na.omit()``` function below when calculating the **total sum of steps per day**, and the size of the resulting grouped data frame (53 rows only instead of 61) :

```{r, echo=TRUE}
activity1grp1 <- activity1 %>% group_by(date) %>% na.omit() %>% summarise_each(funs(sum))
print.data.frame(activity1grp1[, 1:2])
```

Then we can make a **histogram** showing the distribution of the total number of steps taken each day. I chose to break the distribution into 10 bins instead of the default (5) to have a finer resolution, using the ```breaks = 10``` argument :

```{r, echo=TRUE}
hist(activity1grp1$steps, breaks = 10, xlab = "Total number of steps taken each day", 
     main = "Histogram of the total number of steps taken each day", col = "red", xlim = c(0, 25000))
```
   
We'll now calculate the **mean and median** number of steps taken each day :

```{r, echo=TRUE}
mean(activity1grp1$steps)
median(activity1grp1$steps)
```

# Time series plot of the daily average number of steps

We'll now make a time series plot (i.e. ```type = "l"```) of the 5-minute intervals (on the x-axis) and the number of steps taken averaged across all days (on the y-axis), so as to determine the average daily activity pattern of the individual :

```{r, echo=TRUE, warning=FALSE}
activity1grp2 <- activity1 %>% group_by(interval) %>% na.omit() %>% summarise_each(funs(mean))
plot(activity1grp2$interval, activity1grp2$steps, type = "l", xlab = "5-minute intervals", 
     ylab = "Nb of steps taken averaged across all days", main = "Average daily activity pattern", col = "blue")
```

We can see that the individual seems to get up around the 600th interval, which corresponds to 6am. Then we notice that on average the highest peak of activity in terms of steps seems to be located between the 800th and 900th intervals, i.e. between 8 and 9am - this could maybe correspond to the time the individual goes running everyday or walks to the office. Then the main part of activity seems to be located between the 1200th and 1900th intervals, i.e. between 12pm and 7pm - that is to say the afternoon, which seems logical.   
     
Let's now check our findings above regarding the highest peak of activity by calculating which 5-minute interval contains the maximum number of steps on average across all the days in the dataset :

```{r, echo=TRUE}
peakint <- activity1grp2$interval[which.max(activity1grp2$steps)]
sprintf("The 5-minute interval containing the maximum number of steps on average is the %sth", peakint)
```

As we could see in the time series plot above, the highest peak of activity is indeed located between 8 and 9am, more precisely at the 835th interval (8:35am).

# Imputing missing values

As we've seen earlier, there are a number of days/intervals where there are missing values (coded as NAs). The presence of missing days may introduce bias into some calculations or summaries of the data.   
    
We will start by calculating the total number of missing values in the dataset, per variable/column :

```{r, echo=TRUE}
sum(is.na(activity1$steps))
sum(is.na(activity1$date))
sum(is.na(activity1$interval))
```

So we can see that there are 2 304 missing values, all located in the ```steps``` column. This means we have 2 304 rows with NAs out of the 17 568 total rows of the dataset, which corresponds to 13.11% of missing values (2 304 * 100 / 17 568), that is to say more than 1 row out of 10.

Now that we have a clear picture of the NAs in our dataset, it's time to devise a strategy for filling in all of these missing values. We first create a copy of the original dataset that we're calling ```activity2``` and then decide to replace all the NAs with the mean (across all days) of the 5-minute interval for which data is missing, as this seems to be the best option to replace NAs with realistic values :

```{r, echo=TRUE}
activity2 <- activity1

for(i in 1:length(activity1$steps)) {
        if (is.na(activity1$steps[i]) == TRUE) {
                int <- activity1$interval[i]
                activity2$steps[i] <- ceiling(activity1grp2$steps[activity1grp2$interval == int])                
        }
}
```

Remember that the original ```activity1``` dataset did not contain any ```steps``` value for the 1st of October 2012 (among other days), as we can see below :

```{r, echo=TRUE}
head(activity1, 10)
```

We can now check that in our new ```activity2``` dataset, the first 10 rows, i.e. the first 10 intervals of the 1st of October 2012, no longer contain any NA :

```{r, echo=TRUE}
head(activity2, 10)
```

Indeed all the NAs have now been replaced by the the mean of the 5-minute interval for which data was missing (rounded to the nearest higher integer, using the ```ceiling``` function).     
Let's double-check by computing the number of NAs in our new ```activity2``` dataset :

```{r, echo=TRUE}
sum(is.na(activity2))
```


# Total, mean and median number of steps taken each day with missing values imputed

Now that the missing values have been imputed, we will make a new **histogram** of the total number of steps taken each day :

```{r, echo=TRUE}
activity2grp1 <- activity2 %>% group_by(date) %>% summarise_each(funs(sum))
hist(activity2grp1$steps, breaks = 10, xlab = "Total number of steps taken each day", 
     main = "Histogram of the total number of steps taken each day (NAs imputed)", col = "red", xlim = c(0, 25000))
```

This new histogram can look quite similar to the first one we made before imputing the NAs, but actually if we look closer we can see one noticeable difference : while all bins/intervals seem to have their frequency unchanged, the highest bin (located between 10 000 and 12 500 steps) was reinforced quite a lot as a result of our imputing strategy, which seems logical as we used mean values to replace the NAs.

We'll also re-calculate the **mean and median** number of steps taken each day, now that the NAs have been imputed :

```{r, echo=TRUE}
mean(activity2grp1$steps)
median(activity2grp1$steps)
```

Compared to the values computed before imputing the NAs, both the mean and the median have increased but in different proportions : the mean went from 10 766 to 10 785, while the median went from 10 765 to 10 909. So we can see that the median value saw the biggest change (+ 1.34%), while the mean value didn't change much (+ 0.18%). But as the percentages show, both changes remain quite small. Therefore the impact of imputing the missing values on mean and median seems to be limited, which makes sense as our imputing strategy reinforced the original central tendancy of the distribution.

# Comparing the average number of steps taken per interval across weekdays and weekends

We'll now make a panel plot to compare the average number of steps taken per 5-minute interval across weekdays and weekends. To do so, we first need to convert the ```date``` factor variable of our ```activity2``` dataset to the class "Date" using the ```as.Date``` function :

```{r, echo=TRUE}
activity2$date <- as.Date(activity2$date)
class(activity2$date)
```

Then we'll create a new factor variable called ```type``` in the ```activity2``` dataset with two levels, “weekday” and “weekend”, indicating whether a given date is a weekday or a weekend day. We can determine if a given date is a weekday or a weekend day using the ```weekdays``` function. The code below refers to the week days in their French abbreviated forms :

```{r, echo=TRUE}
activity2 <- activity2 %>% mutate(type = NA)
for(i in 1:length(activity2$date)) {
        day <- weekdays(activity2$date[i], abbreviate = TRUE)
        if (day %in% c("Lun", "Mar", "Mer", "Jeu", "Ven")) {
                activity2$type[i] <- "weekday"
        }
        else activity2$type[i] <- "weekend"
}
activity2$type <- factor(activity2$type)
```

Let's now look at the result of our ```for``` loop above :

```{r, echo=TRUE}
table(activity2$type)
```

We have 12 960 weekdays intervals and 4 608 weekend intervals. Let's check these numbers, keeping in mind that there were 45 weekdays and 16 weekend days in the 61 days of October-November 2012 :

```{r, echo=TRUE}
17568 * (45 / 61)
17568 * (16 / 61)
```

We can now finally make a panel plot containing 2 time series plots of the 5-minute intervals (on the x-axis) and the number of steps taken (on the y-axis) averaged across all weekdays (bottom panel) and weekend days (top panel)  :

```{r, echo=TRUE}
activity2grp2 <- activity2 %>% select(-date) %>% group_by(type, interval) %>% summarise_each(funs(mean))
xyplot(steps ~ interval | type, data = activity2grp2, type = "l", xlab = "5-minute intervals", 
       ylab = "Number of steps", main = "Average daily activity pattern on weekdays and weekend", 
       layout = c(1, 2))
```

Without surprise, we can notice a few differences between the weekdays and weekend activity patterns :  

- During the weekend, the activity seems to start later (around 8-9am instead of 6am during the week) and to continue a little bit later as well.
- We can also see that the activity is quite homogeneously spread between 9am and 9pm on weekend days, more homogeneously than on weekdays.
- There seems to be more steps/walking/movement during weekends than during weekdays, which is especially visible during the afternoon. Maybe because during weekends the individual does not have to sit all day at the office ? We would need to know more about the individual job to go further in our hypotheses.
- The 8-9am peak of activity during weekdays is a lot less obvious during weekends. Maybe because the individual does not go running or does not have to go to the office ?

# Additional information

- This analysis was done in RStudio 0.99.467 (R version 3.2.2 (2015-08-14) -- "Fire Safety") on Mac OSX 10.10.5.