wrangling.Rmd

---
title: "Data wrangling using dplyr and tidyr"
---

> Data wrangling: loosely, the process of manually converting data from one
"raw" form into another format that allows for more ease of eventual analysis
and visualization (adapted from [Wikipedia](https://en.wikipedia.org/wiki/Data_wrangling))

R was developed by statisticians to do statistical work.  As such, embedded 
within R are capabilities to easily wrangle and manage data, to have data in a 
format that can be used for further analysis, and to work with datasets called
dataframes. There are also excellent packages available to make data wrangling
much easier in R. These packages are `dplyr` and `tidyr`. A minor assumption
about using these packages: the data you are importing from the start is fairly
clean (i.e. no large amount of missing values, no data entry errors or fixes
needed, etc).

# Learning objectives:

- Appreciate that data wrangling and managing are important and time-consuming
aspects to research
- Learn the 'grammar' of shaping and moving data from one form to another
- Understand and apply basic statistics to quickly summarize your dataset

# Let's get wrangling, the basics

## Wrangling your data, `dplyr` style

Data wrangling can be a bit tedious in base R (R without packages), so we'll be 
using two packages designed to make this easier. `dplyr` comes with a `%>%` pipe 
function (via the `magrittr` package), which works similar to how the Bash shell
`|` pipe works (for those familiar with Bash, ie. those who use Mac or Linux).
The command on the right-hand side takes the output from the command on the
left-hand side, just like how a plumbing pipe works for water.  `tbl_df` makes
the object into a `tbl` class, making printing of the output nicer. The other
nice thing about `dplyr` is that it can connect to SQL and other type of
databases and is very fast at wrangling data, unlike base R. Check out the
[resources page](../resources/) for links to more about this.

```{r wrangle, message=FALSE}
library(dplyr)
library(tidyr)
library(readr)

# Import the dataset
ds <- read_csv("http://codeasmanuscript.org/states_data.csv")

## Compare
head(ds)
## With:
tbl_df(ds)

## Now put the tbl dataset into a new object
ds2 <- tbl_df(ds)
ds2
```

## Select columns

Often times, you want to select only some of the columns or variables from a
dataset.  For that we use the `select` command, which does as it says.  Note the
use of the `%>%` operator.  This allows you to chain commands together, letting
you do more with only a few commands.

```{r wrangleSelect}
ds2 %>% 
    select(Population, Income, Area)
```

The real power with using the `select()` function comes when you combine it with
[regular expressions (regexp)](http://www.regular-expressions.info/), or rather pattern
searching. `dplyr` has several pattern searching functions, including
`starts_with()`, `contains()`, and the most powerful `matches()`. The function
`matches()` uses regexp, which are special commands that use
certain, unique syntax for searching for patterns. For example, `^string` means
that `string` is the first character, `string$` means that `string` is last
character, `string|strung` searches for either `string` or `strung`, etc.
regexp syntax are nearly a language to themselves, so use
[StackOverflow](http://stackoverflow.com/) and Google as much as you can! 

Ok, so lets say you want to search for variables that have certain patterns:

```{r wrangleSelectRegExp}
ds2 %>% 
    select(contains('Pop'), starts_with('Fr'))

## Or more simplified
ds2 %>% 
    select(matches('Pop|Fr'))
```

You can see that if you have many variables that have a common structure to
their name, you can quickly select all those variables by using functions such
as `matches()`.

## Rename columns

You can rename columns using the `rename` command (the new name is on the left
hand side, so `newname = oldname`).

```{r wrangleRename}
ds2 %>% 
    rename(HighSchoolGrad = HSGrad)
```

## Filter rows

Another common task in data wrangling is subsetting your dataset.  You can
subset the dataset using `filter`.  Note the double equal sign `==` for testing
if 'Examination' is equal to 15.  A single `=` is used for something else
(assigning things to objects or using them in functions/commands).

```{r wrangleFilter}
## For continuous/number data
ds2 %>% 
    select(Population, Illiteracy, Income, Region) %>% 
    filter(Illiteracy < 2, Population == 365)

## Or for 'string' (words or letters) data
ds2 %>%
    select(Population, Illiteracy, Income, Region) %>% 
    filter(Region == 'Northeast')
```

## Create new columns or clean up existing ones

If you want to create a new column, you use the `mutate` command.  The
`ifelse()` command lets you use a condition to have different values depending
on the condition.

```{r wrangleMutate}
ds2 %>%
    mutate(Testing = 'yes',
           Rich = ifelse(Income > 5000, 'Yes', 'No')) %>% 
    select(StateName, Population, Income, Rich, Testing)
```

However, it's fairly common that you need to do some data janitorial work by
cleaning up an existing column. For example, in a dataset with a 'Sex' variable,
some values had data entry errors in spelling, such as 'fmale' when it should be
'female'. This needs to be fixed and can be done fairly easily in R. So let's
'pretend that all words starting with 'G' in the `X` (county) column should
'actually be 'J' and that all words with an 'e' at the end should be removed.
'For this, we will use the `gsub()` command within `mutate()`, which will
'*g*lobablly *sub*stitute a pattern with the replacement.

```{r wrangleMutateGsub}
ds2 %>% 
    select(Region) %>% 
    mutate(
        Region = gsub(pattern = '^S', replacement = 'J', Region),
        Region = gsub(pattern = 't$', replacement = '', Region)
        )
```

Notice the `^` and `$` characters. Those are special syntax symbols used in
regexp commands. We introduced them above, but we'll quickly go over
it again. These special symbols perform certain functions. In this case `^G`
means for all "G" that are at the start of the string/character, while `e$`
means for all "e" that are at the end of a string. Or let"s say that all "mont",
"mout", and "mnt" should actually be "ment". We can do some cleaning fairly easily
here.

```{r wrangleMutateGsub2}
ds2 %>% 
    mutate(Region = gsub('Northeast', 'NE', Region, ignore.case = TRUE)) %>% 
    select(Region)
```

Regular expressions are incredibly powerful, but also can be confusing. Make
sure to check out [our resources page](/lessons/resources/) for links to sites
that explain regexp in more detail.

## Chaining pipes

We can start chaining these commands together using the `%>%` pipe command.
There is no limit to how long a chain can be.  Chaining commands together using
the pipe command makes your code easier to read, makes you type out your code
faster, and makes it easier for you to go from thinking of an analysis to
actually conducting it.  `arrange` sorts/orders/re-arranges the column Education
in ascending order.

```{r wrangleChain}
ds2 %>%
  filter(Illiteracy > 2) %>%
  select(State = StateName, ## This renames the variable, just like the rename() command
         Population, Area, Frost) %>%
  arrange(Population) %>%
  mutate(Cold = ifelse(Frost > 100, 'Yes', 'No'))
```

## Re-organize your data (using `tidyr`)

To get the data into a nicer and more analyable format, you can use the `tidyr`
package.  See what `gather` does in the code below.  Then see what `spread`
does.  Note that you can remove a column by having a minus `-` sign in front of
a variable when you use `select`.

```{r reorg}
## Compare this:
ds2 %>%
  select(-Division, -Region, -matches('itude$'))

## With this:
ds2 %>%
  select(-Division, -Region, -matches('itude$')) %>% 
  gather(Measure, Value, -StateName)

## And back again:
ds2 %>%
  select(-Division, -Region, -matches('itude$')) %>% 
  gather(Measure, Value, -StateName) %>%
  spread(Measure, Value)
```

## Summarise variables

Combined with `dplyr`'s `group_by` and `summarise` you can quickly summarise
data or do further, more complicated analyses. `group_by` makes it so further
analyses or operations work on the groups. `summarise` transforms the data to
only contain the new variable(s) created, in this case the mean, as well as the
grouping variable.

```{r reorgChain}
ds2 %>%
  select(-Division, -Region, -StateName) %>% 
  gather(Measure, Value) %>%
  group_by(Measure) %>%
  summarise(Mean = mean(Value),
            SD = sd(Value),
            Median = median(Value),
            SampleSize = n())
```

## Other useful and powerful examples

You can do some really powerful things with `dplyr` and `tidyr` functions.  For 
example, you can run each combination of independent and dependent variables in 
a linear regression (`lm()`) using `gather()` and the `dplyr` `do()` command, 
rather than running each individually.  To make the output from the `lm()`
easier to combine and understand, we use the `tidy()` command from the `broom`
package (`::` tells R we want to use the `tidy()` function from the `broom`
package).  If you want more details on how to use this set up,
[check out my blog post about it](http://www.lukewjohnston.com/blog/loops-forests-multiple-linear-regressions/).

```{r}
library(broom)
# Using the swiss practice dataset.
swiss %>%
    gather(Indep, Xvalue, Fertility, Agriculture) %>%
    gather(Dep, Yvalue, Education, Catholic) %>% 
    group_by(Dep, Indep) %>% 
    do(tidy(lm(Yvalue ~ Xvalue + Infant.Mortality + Examination, data = .)))
```