-
Notifications
You must be signed in to change notification settings - Fork 33
/
lec05-dplyr.Rmd
493 lines (395 loc) · 15.4 KB
/
lec05-dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
---
title: "More dplyr and ggplot, tidying and exporting data"
author: Joel Östblom
---
## Lesson preamble
> ### Lecture objectives
>
> - Understand how to combine dplyr and ggplot.
> - Understand and apply faceting in ggplot.
> - Learn about tidy data.
> - Transform data from the long to wide format.
>
> ### Lecture outline
>
> - Split-apply-combine... plot! (30 min)
> - Faceting (10 min)
> - Reshaping with gather and spread (30 min)
> - Exporting data (15 min)
-----
## Setting up
Start by loading the required packages. Both **`ggplot2`** and **`dplyr`** are
included in the **`tidyverse`** package collection.
```{r}
# Install if needed
# install.packages('tidyverse')
library(tidyverse)
```
Load the data we saved in the previous lesson, and prepare the filtered abundant species
dataset from the end of lecture 4.
```{r, eval=FALSE}
# Download if needed
# download.file("https://ndownloader.figshare.com/files/2292169", "portal_data.csv")
surveys <- read_csv('portal_data.csv')
surveys_abun_species <- surveys %>%
filter(!is.na(hindfoot_length) & !is.na(weight)) %>%
group_by(species) %>%
mutate(n = n()) %>% # add count value to each row
filter(n > 800) %>%
select(-n)
```
```{r, echo=FALSE}
surveys <- read_csv('data/portal_data.csv')
surveys_abun_species <- surveys %>%
filter(!is.na(hindfoot_length) & !is.na(weight)) %>%
group_by(species) %>%
mutate(n = n()) %>% # add count value to each row
filter(n > 800) %>%
select(-n)
```
### Split-apply-combine... plot!
In this section, we will learn how to work with `**dplyr**` and `**ggplot**`
together. Aided by the pipes (`%>%`), we can create a powerful data exploration
workflow using these two packages.
Let's calculate number of counts per year for each species. First we need
to group the data and count records within each group:
```{r}
surveys_abun_species %>%
group_by(year, genus) %>%
tally() %>%
arrange(desc(n)) # Adding arrange just to compare with histogram
```
We could assign this table to a variable, and then pass that variable to
`ggplot()`.
```{r}
yearly_counts <- surveys_abun_species %>%
group_by(year, species) %>%
tally()
ggplot(yearly_counts, aes(x = n)) +
geom_histogram()
```
Creating an intermediate variable would be preferable for time consuming
calculations, because you would not want to do that operation every time you
change the plot aesthetics.
If it is not a time consuming calculation or you would like the flexibility of
changing the data summary and the plotting options in the same code chunk, you
can pipe the output of your split-apply-combine operation to the plotting
command:
```{r}
surveys_abun_species %>%
group_by(year, species) %>%
tally() %>%
ggplot(aes(x = n)) +
geom_histogram()
```
We can perform a quick check that the plot corresponds to the table by coloring
the histogram by species:
```{r}
surveys_abun_species %>%
group_by(year, species) %>%
tally() %>%
ggplot(aes(x = n, fill = species)) + # `fill` is specific for histograms
geom_histogram()
```
Let's explore how the number of each genus varies over time. Longitudinal data
can be visualized as a line plot with years on the x axis and counts on the y
axis:
```{r}
surveys_abun_species %>%
group_by(year, species) %>%
tally() %>%
ggplot(aes(x = year, y = n)) +
geom_line()
```
Unfortunately, this does not work because we plotted data for all the species
together. We need to tell ggplot to draw a line for each species by modifying
the aesthetic function to include `group = species`:
```{r}
surveys_abun_species %>%
group_by(year, species) %>%
tally() %>%
ggplot(aes(x = year, y = n, group = species)) +
geom_line()
```
We will be able to distinguish species in the plot if we add colors (using
`color` also automatically groups the data):
```{r}
surveys_abun_species %>%
group_by(year, species) %>%
tally() %>%
ggplot(aes(x = year, y = n, color = species)) + # `color` groups automatically
geom_line() # try adding `size = 2`
```
### Faceting
ggplot has a special technique called *faceting* that allows the user to split
one plot into multiple subplots based on a variable included in the dataset. We
will use it to make a time series plot for each species:
```{r}
surveys_abun_species %>%
group_by(year, species) %>%
tally() %>%
ggplot(aes(x = year, y = n)) + #
geom_line() +
facet_wrap(~ species)
```
Now we would like to split the line in each plot by the sex of each individual
measured. To do that we need to make counts in the data frame grouped by `year`,
`species`, and `sex`:
```{r}
surveys_abun_species %>%
group_by(year, species, sex) %>%
tally()
```
We can make the faceted plot by splitting further by sex using `color` (within a
single plot):
```{r}
surveys_abun_species %>%
group_by(year, species, sex) %>%
tally() %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species)
```
There are several observations where no sex was recorded. In this case, we are
not really interested in the observations of unknown sex, so we can filter out
those values.
```{r}
surveys_abun_species %>%
filter(!is.na(sex)) %>%
group_by(year, species, sex) %>%
tally() %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species)
```
It is possible to specify exactly which colors to use, to avoid those that are
hard to distinguish. We can also change the thickness of the lines, and adjust
the x labels so that they don't overlap.
```{r}
color_names <- c('black', 'orange')
surveys_abun_species %>%
filter(!is.na(sex)) %>%
group_by(year, species, sex) %>%
tally() %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line(size = 1) +
scale_color_manual(values = color_names) +
facet_wrap(~ species) +
theme_bw() +
theme(text = element_text(size=12),
axis.text.x = element_text(angle=30, hjust=1))
```
#### Challenge
Use the filtered data frame for all of these questions.
1. Remember the histogram colored according to each species? Starting from that
code, how could we separate each species into its own subplot?
2.
a. Which year was the average weight of the animals the highest?
b. Is the yearly trend the same for all species?
```{r, include=FALSE}
# Answers
# 1
ggplot(yearly_counts, aes(x = n, fill = species)) +
geom_histogram() +
facet_wrap(~ species)
```
```{r, include=FALSE}
# 2.a
surveys_abun_species %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
summarize(mean_weight = mean(weight)) %>%
ggplot(aes(x = year, y = mean_weight)) +
geom_line()
```
```{r, include=FALSE}
# 2.b
surveys_abun_species %>%
filter(!is.na(weight)) %>%
group_by(year, species) %>%
summarize(mean_weight = mean(weight)) %>%
ggplot(aes(x = year, y = mean_weight, color = species)) +
geom_line() +
facet_wrap(~ species)
```
## Reshaping with gather and spread
**`dplyr`** is one part of a larger **`tidyverse`** that enables you to work
with data in tidy data formats. **`tidyr`** enables a wide range of
manipulations of the structure data itself. For example, the survey data
presented here is almost in what we call a **long** format - every observation
of every individual is its own row. This is an ideal format for data with a rich
set of information per observation. It makes it difficult, however, to look at
the relationships between measurements across plots. For example, what is the
relationship between mean weights of different genera across the entire data
set?
To answer that question, we'd want each plot to have a single row, with all of
the measurements in a single plot having their own column. This is called a
**wide** data format. For the `surveys` data as we have it right now, this is
going to be one heck of a wide data frame! However, if we were to summarize data
within plots and species, we might begin to have some relationships we'd want to
examine.
Let's see this in action. First, using **`dplyr`**, let's create a data frame
with the mean body weight of each genus by plot.
```{r}
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(genus, plot_id) %>%
summarize(mean_weight = mean(weight))
head(surveys_gw)
```
### Long to Wide with `spread`
Now, to make this long data wide, we use `spread` from `tidyr` to spread out the
different taxa into columns. `spread` takes three arguments: - the data, the
*key* column (or column with identifying information), the *values* column (the
one with the numbers/values). We'll use a pipe so we can ignore the data
argument.
```{r}
surveys_gw_wide <- surveys_gw %>%
spread(genus, mean_weight)
head(surveys_gw_wide)
```
Notice that some genera have `NA` values. That's because some of those genera
don't have any record in that plot. Sometimes it is fine to leave those as
`NA`. Sometimes we want to fill them as zeros, in which case we would add the
argument `fill=0`.
```{r}
surveys_gw %>%
spread(genus, mean_weight, fill = 0) %>%
head
```
We can now do things like plot the weight of *Baiomys* against *Chaetodipus* or
examine their correlation.
```{r}
surveys_gw %>%
spread(genus, mean_weight, fill = 0) %>%
cor(use = "pairwise.complete")
```
### Wide to long with `gather`
What if we had the opposite problem, and wanted to go from a wide to long
format? For that, we use `gather` to sweep up a set of columns into one
key-value pair. We give it the arguments of a new key and value column name, and
then we specify which columns we either want or do not want gathered up. So, to
go backwards from `surveys_gw_wide`, and exclude `plot_id` from the gathering,
we would do the following:
```{r}
surveys_gw_long <- surveys_gw_wide %>%
gather(genus, mean_weight, -plot_id)
head(surveys_gw_long)
```
Note that now the `NA` genera are included in the long format. Going from wide
to long to wide can be a useful way to balance out a dataset so every replicate
has the same composition.
We could also have used a specification for what columns to include. This can be
useful if you have a large number of identifying columns, and it's easier to
specify what to gather than what to leave alone. And if the columns are
sequential, we don't even need to list them all out - just use the `:` operator!
```{r}
surveys_gw_wide %>%
gather(genus, mean_weight, Baiomys:Spermophilus) %>%
head()
```
#### Challenge
1. Make a wide data frame with `year` as columns, `plot_id` as rows, and the
values are the number of genera per plot. You will need to summarize before
reshaping, and use the function `n_distinct` to get the number of unique
types of a genus. It's a powerful function! See `?n_distinct` for more.
2. Now take that data frame, and make it long again, so each row is a unique
`plot_id` - `year` combination.
3. The `surveys` data set is not truly wide or long because there are
two columns of measurement - `hindfoot_length` and `weight`. This makes it
difficult to do things like look at the relationship between mean values of
each measurement per year in different plot types. Let's walk through a
common solution for this type of problem. First, use `gather` to create a
truly long dataset where we have a key column called `measurement` and a
`value` column that takes on the value of either `hindfoot_length` or
`weight`. Hint: You'll need to specify which columns are being gathered.
4. With this new truly long data set, calculate the average of each
`measurement` in each `year` for each different `plot_type`. Then
`spread` them into a wide data set with a column for `hindfoot_length` and
`weight`. Hint: Remember, you only need to specify the key and value
columns for `spread`.
```{r, include=FALSE}
## Answer 1
rich_time <- surveys %>%
group_by(plot_id, year) %>%
summarize(n_genera = n_distinct(genus)) %>%
spread(year, n_genera)
head(rich_time)
## Answer 2
rich_time %>%
gather(year, n_genera, -plot_id)
## Answer 3
surveys_long <- surveys %>%
gather(measurement, value, hindfoot_length, weight)
## Answer 4
surveys_long %>%
group_by(year, measurement, plot_type) %>%
summarize(mean_value = mean(value, na.rm=TRUE)) %>%
spread(measurement, mean_value)
```
## Exporting data
Now that you have learned how to use **`dplyr`** to extract information from
or summarize your raw data, you may want to export these new datasets to share
them with your collaborators or for archival.
Similar to the `read_csv()` function used for reading CSV files into R, there is
a `write_csv()` function that generates CSV files from data frames.
Before using `write_csv()`, we are going to create a new folder,
`data-processed`, in our working directory that will store this generated
dataset. We don't want to store manipulated datasets in the same directory as
our raw data. It's good practice to keep them separate. The raw data would
ideally be put in a `data-raw` folder, which should only contain the raw,
unaltered data, and should be left alone to make sure we don't delete or modify
it from how it was when we downloaded or recorded it ourself. In contrast,
our R code will create the contents of the `data-processed` directory, so even
if the files it contains are deleted, we can always re-generate them.
Use the `getwd()` function to find out which is the current working directory.
```{r}
getwd()
```
Navigate to this directory in your file browser and create a folder called
`data-processed`.
Alternatively, you could use R to create this directory.
```{r, eval=FALSE}
dir.create("data-processed")
# To suppress the warning, we could do
dir.create("data-processed", showWarnings = FALSE)
# Another alternative would be to use a conditional expression, which only
# creates the directory *if* it does not already exist. The syntax here is
# similar to the for loop we created in the second lecture.
if (!dir.exists('data-processed')) {
dir.create("data-processed")
}
```
We are going to prepare a cleaned up version of the data without NAs. Let's
start by removing observations for which the `species_id` is missing. Let's also
remove observations for which `weight` and the `hindfoot_length` are missing.
This dataset should also only contain observations of animals for which the sex
has been determined:
```{r}
surveys_complete <- surveys %>%
filter(!is.na(species_id), # remove missing species_id
!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex
# This expression is a succinct alternative to the above
surveys_complete_comcas <- surveys %>%
filter(complete.cases(species_id, weight, hindfoot_length, sex))
# This is even briefer, but omits observations with NA in *any* column.
# There is no way to control which columns to use, but it is common to want
# to exclude all NAs, which in our case corresponds to the columns listed above.
surveys_complete_naomit <- na.omit(surveys)
# Compare the dimensions of the original and the cleaned data frame
dim(surveys)
dim(surveys_complete)
dim(surveys_complete_comcas)
dim(surveys_complete_naomit)
```
Now that our dataset is ready, we can save it as a CSV file in our `data-processed`
folder.
```{r, eval=FALSE}
write_csv(surveys_complete, path = "data-processed/surveys_complete.csv")
```
*Parts of this lesson material were taken and modified from [Data
Carpentry](https://datacarpentry.org) under their CC-BY copyright license. See
their [lesson page](https://datacarpentry.org/R-ecology-lesson/03-dplyr.html)
for the original source.*