-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwrangling.Rmd
254 lines (206 loc) · 9.32 KB
/
wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
title: "Data wrangling using dplyr and tidyr"
---
> Data wrangling: loosely, the process of manually converting data from one
"raw" form into another format that allows for more ease of eventual analysis
and visualization (adapted from [Wikipedia](https://en.wikipedia.org/wiki/Data_wrangling))
R was developed by statisticians to do statistical work. As such, embedded
within R are capabilities to easily wrangle and manage data, to have data in a
format that can be used for further analysis, and to work with datasets called
dataframes. There are also excellent packages available to make data wrangling
much easier in R. These packages are `dplyr` and `tidyr`. A minor assumption
about using these packages: the data you are importing from the start is fairly
clean (i.e. no large amount of missing values, no data entry errors or fixes
needed, etc).
# Learning objectives:
- Appreciate that data wrangling and managing are important and time-consuming
aspects to research
- Learn the 'grammar' of shaping and moving data from one form to another
- Understand and apply basic statistics to quickly summarize your dataset
# Let's get wrangling, the basics
## Wrangling your data, `dplyr` style
Data wrangling can be a bit tedious in base R (R without packages), so we'll be
using two packages designed to make this easier. `dplyr` comes with a `%>%` pipe
function (via the `magrittr` package), which works similar to how the Bash shell
`|` pipe works (for those familiar with Bash, ie. those who use Mac or Linux).
The command on the right-hand side takes the output from the command on the
left-hand side, just like how a plumbing pipe works for water. `tbl_df` makes
the object into a `tbl` class, making printing of the output nicer. The other
nice thing about `dplyr` is that it can connect to SQL and other type of
databases and is very fast at wrangling data, unlike base R. Check out the
[resources page](../resources/) for links to more about this.
```{r wrangle, message=FALSE}
library(dplyr)
library(tidyr)
library(readr)
# Import the dataset
ds <- read_csv("http://codeasmanuscript.org/states_data.csv")
## Compare
head(ds)
## With:
tbl_df(ds)
## Now put the tbl dataset into a new object
ds2 <- tbl_df(ds)
ds2
```
## Select columns
Often times, you want to select only some of the columns or variables from a
dataset. For that we use the `select` command, which does as it says. Note the
use of the `%>%` operator. This allows you to chain commands together, letting
you do more with only a few commands.
```{r wrangleSelect}
ds2 %>%
select(Population, Income, Area)
```
The real power with using the `select()` function comes when you combine it with
[regular expressions (regexp)](http://www.regular-expressions.info/), or rather pattern
searching. `dplyr` has several pattern searching functions, including
`starts_with()`, `contains()`, and the most powerful `matches()`. The function
`matches()` uses regexp, which are special commands that use
certain, unique syntax for searching for patterns. For example, `^string` means
that `string` is the first character, `string$` means that `string` is last
character, `string|strung` searches for either `string` or `strung`, etc.
regexp syntax are nearly a language to themselves, so use
[StackOverflow](http://stackoverflow.com/) and Google as much as you can!
Ok, so lets say you want to search for variables that have certain patterns:
```{r wrangleSelectRegExp}
ds2 %>%
select(contains('Pop'), starts_with('Fr'))
## Or more simplified
ds2 %>%
select(matches('Pop|Fr'))
```
You can see that if you have many variables that have a common structure to
their name, you can quickly select all those variables by using functions such
as `matches()`.
## Rename columns
You can rename columns using the `rename` command (the new name is on the left
hand side, so `newname = oldname`).
```{r wrangleRename}
ds2 %>%
rename(HighSchoolGrad = HSGrad)
```
## Filter rows
Another common task in data wrangling is subsetting your dataset. You can
subset the dataset using `filter`. Note the double equal sign `==` for testing
if 'Examination' is equal to 15. A single `=` is used for something else
(assigning things to objects or using them in functions/commands).
```{r wrangleFilter}
## For continuous/number data
ds2 %>%
select(Population, Illiteracy, Income, Region) %>%
filter(Illiteracy < 2, Population == 365)
## Or for 'string' (words or letters) data
ds2 %>%
select(Population, Illiteracy, Income, Region) %>%
filter(Region == 'Northeast')
```
## Create new columns or clean up existing ones
If you want to create a new column, you use the `mutate` command. The
`ifelse()` command lets you use a condition to have different values depending
on the condition.
```{r wrangleMutate}
ds2 %>%
mutate(Testing = 'yes',
Rich = ifelse(Income > 5000, 'Yes', 'No')) %>%
select(StateName, Population, Income, Rich, Testing)
```
However, it's fairly common that you need to do some data janitorial work by
cleaning up an existing column. For example, in a dataset with a 'Sex' variable,
some values had data entry errors in spelling, such as 'fmale' when it should be
'female'. This needs to be fixed and can be done fairly easily in R. So let's
'pretend that all words starting with 'G' in the `X` (county) column should
'actually be 'J' and that all words with an 'e' at the end should be removed.
'For this, we will use the `gsub()` command within `mutate()`, which will
'*g*lobablly *sub*stitute a pattern with the replacement.
```{r wrangleMutateGsub}
ds2 %>%
select(Region) %>%
mutate(
Region = gsub(pattern = '^S', replacement = 'J', Region),
Region = gsub(pattern = 't$', replacement = '', Region)
)
```
Notice the `^` and `$` characters. Those are special syntax symbols used in
regexp commands. We introduced them above, but we'll quickly go over
it again. These special symbols perform certain functions. In this case `^G`
means for all "G" that are at the start of the string/character, while `e$`
means for all "e" that are at the end of a string. Or let"s say that all "mont",
"mout", and "mnt" should actually be "ment". We can do some cleaning fairly easily
here.
```{r wrangleMutateGsub2}
ds2 %>%
mutate(Region = gsub('Northeast', 'NE', Region, ignore.case = TRUE)) %>%
select(Region)
```
Regular expressions are incredibly powerful, but also can be confusing. Make
sure to check out [our resources page](/lessons/resources/) for links to sites
that explain regexp in more detail.
## Chaining pipes
We can start chaining these commands together using the `%>%` pipe command.
There is no limit to how long a chain can be. Chaining commands together using
the pipe command makes your code easier to read, makes you type out your code
faster, and makes it easier for you to go from thinking of an analysis to
actually conducting it. `arrange` sorts/orders/re-arranges the column Education
in ascending order.
```{r wrangleChain}
ds2 %>%
filter(Illiteracy > 2) %>%
select(State = StateName, ## This renames the variable, just like the rename() command
Population, Area, Frost) %>%
arrange(Population) %>%
mutate(Cold = ifelse(Frost > 100, 'Yes', 'No'))
```
## Re-organize your data (using `tidyr`)
To get the data into a nicer and more analyable format, you can use the `tidyr`
package. See what `gather` does in the code below. Then see what `spread`
does. Note that you can remove a column by having a minus `-` sign in front of
a variable when you use `select`.
```{r reorg}
## Compare this:
ds2 %>%
select(-Division, -Region, -matches('itude$'))
## With this:
ds2 %>%
select(-Division, -Region, -matches('itude$')) %>%
gather(Measure, Value, -StateName)
## And back again:
ds2 %>%
select(-Division, -Region, -matches('itude$')) %>%
gather(Measure, Value, -StateName) %>%
spread(Measure, Value)
```
## Summarise variables
Combined with `dplyr`'s `group_by` and `summarise` you can quickly summarise
data or do further, more complicated analyses. `group_by` makes it so further
analyses or operations work on the groups. `summarise` transforms the data to
only contain the new variable(s) created, in this case the mean, as well as the
grouping variable.
```{r reorgChain}
ds2 %>%
select(-Division, -Region, -StateName) %>%
gather(Measure, Value) %>%
group_by(Measure) %>%
summarise(Mean = mean(Value),
SD = sd(Value),
Median = median(Value),
SampleSize = n())
```
## Other useful and powerful examples
You can do some really powerful things with `dplyr` and `tidyr` functions. For
example, you can run each combination of independent and dependent variables in
a linear regression (`lm()`) using `gather()` and the `dplyr` `do()` command,
rather than running each individually. To make the output from the `lm()`
easier to combine and understand, we use the `tidy()` command from the `broom`
package (`::` tells R we want to use the `tidy()` function from the `broom`
package). If you want more details on how to use this set up,
[check out my blog post about it](http://www.lukewjohnston.com/blog/loops-forests-multiple-linear-regressions/).
```{r}
library(broom)
# Using the swiss practice dataset.
swiss %>%
gather(Indep, Xvalue, Fertility, Agriculture) %>%
gather(Dep, Yvalue, Education, Catholic) %>%
group_by(Dep, Indep) %>%
do(tidy(lm(Yvalue ~ Xvalue + Infant.Mortality + Examination, data = .)))
```