-
Notifications
You must be signed in to change notification settings - Fork 1
/
tidyverse short tutorial.Rmd
329 lines (210 loc) · 14.9 KB
/
tidyverse short tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "Tidyverse Short Tutorial"
author: "Robin Choudhury and Neil McRoberts"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
toc_collapsed: true
toc_depth: 6
number_sections: true
---
## Install and Load Tidyverse
```{r setup}
knitr::opts_chunk$set(echo = TRUE)
# install.packages("tidyverse") #in case you haven't installed it
library(tidyverse)
```
## Tidyverse: a Short Introduction
Tidyverse is a cluster of related packages developed by Hadley Wickam and collaborators meant to make data analysis and visualization easier in R. I (RAC) am NOT an expert on Tidyverse, although I have used it basically daily since 2016-ish. This tutorial is meant to dip your toe into the Tidyverse, not make you a master. The packages inside of Tidyverse are constantly updating, with old functions being deprecated and new functions being added. Some of the lessons that I give here might be out of date in a few years (I am writing this on 2021-11-23). Some of the stuff here is highly opinionated, and Dr. McRoberts may disagree with bits and pieces of this document, but it's meant to help you as you start your journey.
![*An illustration of how the various Tidyverse packages relate to one another and are used for data analysis.*](https://oliviergimenez.github.io/intro_tidyverse/assets/img/01_tidyverse_data_science.png)
The packages in Tidyverse helped to solve many issues and gaps that came up with the standard installation of R (which I will be calling Base R). Loading in Excel/CSV files, re-arranging data, formatting dates, creating new variables, and graphing were all possible in Base R, but are (*arguably*) easier using Tidyverse.
### Iris Dataset
We will be using the iris dataset to explore how to use tidyverse packages. The iris dataset is relatively small (150 x 5) and can be useful for many different purposes, including multivariate statistics, correlation, and data visualization. It describes the petal and sepal lengths and widths for three different iris species (*Iris setosa*, *I. versicolor*, and *I. virginica*).
![*Photgraphs of three different iris species used in the iris dataset.*](https://miro.medium.com/max/1000/0*oUoXifiKu3tT5REt.png)
#### Load in the Iris Dataset
Load in the Iris dataset and create a dummy variable for the date so that we can play around with it later
```{r iris}
#create tibble of iris
df <- as_tibble(iris) %>%
mutate(date = rep(c("2021-11-23", "11/23/21", "11/23/2021"), len = 150))
```
#### View the Top of the Dataset
The function **head** allows you to look at the top of a dataset, including the column names and the first few lines of data.
```{r head iris}
head(df)
```
#### Saving Iris as a CSV
Let's save this dataset as a comma separated value (CSV) file so that we can import it later.
```{r iris save csv}
#write_csv(x = df,file = "data/iris.csv" )
```
## Importing Data
Importing data used to be one of the most frustrating parts of using R. Previously (circa ~2013) I would use read.csv(file.choose()) and manually pick out my dataset, but this made it so that if I didn't specify a path then anyone using the code later (or even really me) wouldn't know where the data was. The use of projects in RStudio allows users to keep all relevant files, figures, manuscripts in one place, making it easier to share code.
We just saved our Iris dataset as a csv file, now lets load it back in. We will be making this dataset a *tibble*; tibbles are sometimes more efficient to use in analyses. They are described as:
>"Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist)." (https://tibble.tidyverse.org)
```{r load csv}
#df = read_csv("data/iris.csv") %>%
# as_tibble()
```
#### But the Dates are All Wrong...
If you look at the date column, you'll see that it is a character value, not recognized as a date value. R won't know how to parse the dates that you put in, especially if they are of mixed types (e.g. - MDY, YMD, DMY, etc). You can fix this using the parse_date_time function in the Lubridate package in R, and specifying the possible date types in your dataset. Date types are specified in the form of mdy for "month, day, year". This transforms the date column from a character type to a POSIXct type, which is parsable by R as a date value.
```{r change date}
df = df %>%
mutate(date = lubridate::parse_date_time(date, orders = c("mdy", "Ymd")))
```
#### Ceci N'est Pas Une Pipe...
In many programming languages, you can pass the values of one function to another function without needing to save it as an intermediate object in between. This saves you space on your computer and makes life easier down the line. This passing function is called a **pipe**, and is represented in R code as **%>%**. If you've been watching the code I've been using above, you may have noticed it. In the above function, we are passing the tibble data.frame **df** through a pipe to create a new function using **mutate**, and then using **lubridate::parse_date_time** to parse the mixed date formats. The Tidyverse package **magrittr** allows your to use pipes in R; it's a play on the famous Rene Magritte painting "The Treachery of Images".
![*The Treachery of Images by Rene Magritte.*](https://upload.wikimedia.org/wikipedia/en/b/b9/MagrittePipe.jpg)
#### Specifying Which Package to Source Your Functions From
If you want to specify which package you want your function to come from, you can use the **"::"** sign after the name of the package. This is useful if you have a commonly named function (e.g. **filter**) that may exist in multiple packages. This is also useful if you don't want to load in a lot of extra packages at the front end of a codeset, and you only want to pick and choose which functions to select from different packages.
## Data Tidying and Wrangling
We have already explored a bit of data tidying and wrangling with the use of the **lubridate** and **magrittr** packages. Tidyverse is also capable of helping with strings, factors, and other date types.
One of the central and most useful packages in Tidyverse is dplyr, which helps to filter data, select columns to keep or drop, arrange rows, and create new columns using mutate. Let's explore some of these functions.
We do this data wrangling in R because that way we can leave the original data alone. Previously, I would (foolishly) try to do some of this in Excel, but that became hard to track down how I modified different things. This allows you to repeatably modify data so that others can see how you modified your data.
#### Data Filtering
Let's **filter** the iris dataset to only keep the I. versicolor data samples...
```{r filter}
#Filter for versicolor species
filter(.data = df, Species == "versicolor")
```
#### Column Selection and Dropping
You can select or drop columns in R using **select**
```{r select}
#Drop select columns from the data
select(.data = df, Species, Petal.Width, Petal.Length)
```
#### Create New Columns
You can create new columns my using the **mutate** function
```{r mutate}
#Create a new column
mutate(.data = df, sepal_area = Sepal.Length * Sepal.Width)
```
#### Group and Summarize Values
If you want a summary of some group of values, you can use the **summary** function...
```{r}
#Group and summarize
group_by(.data = df, Species) %>%
summarize(mean_sepal_length = mean(Sepal.Length))
```
#### Pivot Your Data Longer or Wider
You frequently want to change your dataset from long to wide. Imagine if you collected data in an excel file and a few columns were ratings from different plants for the same treatment (e.g. Plant 1: 10%, 15%, 10%, 20%). In this scenario, your data would be called **wide** because the values are spread out. In the Iris dataset, values are **long** when considering species because there is a column for species, but the dataset could be longer if you group the different types of data measurements into one column. We can do that using the **pivot_longer** and **pivot_wider** functions in the **tidyr** package.
```{r pivot}
iris %>%
rowid_to_column() %>%
pivot_longer(-c(rowid, Species), values_to = "measurement") %>%
select(-rowid)
```
## Graphing and Visualizing Data
You should be graphing and visualizing your data as you go. Why? Lord knows that almost everyone I know has had an data input error, where someone meant to create two rows with values 10 and 11, and ended up creating a single value of 1011. If you are on a percent scale that will create chaos! Visualizing your data will allow you to spot data input errors more easily. It will also allow you to formulate some hypotheses for how best to analyze your data. For instance, if you see that there is a relationship between two variables in a graph, you may want to conduct a regression or a correlation. The graph may also help to guide if the regression should be linear or not.
#### Scatterplot
Scatterplots are kind of the go-to way to first visualize your data, especially if you think there is a relationship between two variables. We are going to do this with the r package **ggplot2**. The way that **ggplot2** works is that it adds on layers to existing layers using the **"+"** sign. Each **+** lets ggplot know that you are adding on something else in the next line. Lets compare sepal length and sepal width for the three iris species.
```{r scatter plot}
#Scatterplot
ggplot(data=df, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point(aes(color=Species, shape=Species)) +
xlab("Sepal Length") +
ylab("Sepal Width") +
ggtitle("Sepal Length-Width")
```
#### Boxplots
Visualizing with scatterplots helps you to represent the raw data very well, but it doesn't do a good job of showing summary statistics for the data. In recent years, scientists have been moving away from bar plots (see https://thenode.biologists.com/barbarplots/photo/ for a detailed explanation).
```{r boxplot}
#Boxplot
ggplot(data=df, aes(x=Species, y=Sepal.Length)) +
geom_boxplot(aes(fill=Species)) +
ylab("Sepal Length") +
ggtitle("Iris Boxplot")
```
#### Raw and Summarized Data?
Okay, so boxplots can show you summarized data well, but what if you want both? I mentioned earlier that **ggplot** layers on data, so each subsequent layer goes on top of the others. So we can put down the raw dataset (here in the form of a jitterplot, a scatterplot with a little bit of random noise added in) and we made the boxplot a little bit transparent (using the alpha command):
```{r boxplot jitterplot}
#Boxplot + Jitter plot
ggplot(data=df, aes(x=Species, y=Sepal.Length)) +
geom_jitter(shape = 21, height = 0,aes(fill = Species)) +
geom_boxplot(aes(fill=Species), alpha = 0.5) +
ylab("Sepal Length") +
ggtitle("Iris Boxplot")
```
#### Histograms
Oftentimes, we want to know what the distribution of dataset is. This will give us an idea if we are working with normal data or if we need to use non-parametric statistics to analyze.
```{r histogram}
#Histograms
ggplot(data=df, aes(x=Sepal.Width)) +
geom_histogram(position = "dodge",binwidth=0.2, color="black", aes(fill=Species)) +
xlab("Sepal Width") +
ylab("Frequency") +
ggtitle("Histogram of Sepal Width")
```
#### Facets of Graphs
We sometimes want to visualize datasets side by side, without having them all intermingled. Imagine if you had collected spore data from multiple sites and wanted to see what was going on at individual sites or years. We can visualize data in this way using facets. Facets can be applied as a wrap (where R will organize them in whatever order you specify in the dataset) or they can be faceted as a grid (for example, by site and year). Lets facet our dataset as a grid based on the species, and keep the different scales free.
```{r facets}
#Faceting
ggplot(data=df, aes(Sepal.Length, y=Sepal.Width,
color=Species)) +
geom_point(aes(shape=Species), size=1.5) +
xlab("Sepal Length") +
ylab("Sepal Width") +
ggtitle("Faceting") +
facet_grid(Species~., scales = "free")
```
#### Changing the Background and Other Themes
Sometimes you want the background of a graph to be a different color. You can pretty easily change stuff in ggplot, lets look at that:
```{r facets background}
#Faceting
ggplot(data=df, aes(Sepal.Length, y=Sepal.Width,
color=Species)) +
geom_point(aes(shape=Species), size=1.5) +
xlab("Sepal Length") +
ylab("Sepal Width") +
ggtitle("Faceting") +
theme_bw() +
facet_grid(Species~., scales = "free")
```
#### Is There a Trend in the Data?
Sometimes you want to see if there is a trend in the data. You can add on a smoothing function to look at that, which defaults to a loess analysis.
```{r facets background smooth}
#Faceting
ggplot(data=df, aes(Sepal.Length, y=Sepal.Width,
color=Species)) +
geom_point(aes(shape=Species), size=1.5) +
geom_smooth() +
xlab("Sepal Length") +
ylab("Sepal Width") +
ggtitle("Faceting") +
theme_bw() +
facet_grid(Species~., scales = "free")
```
#### Can We See a Linear Trend?
The loess looks fairly linear, lets see what that looks like if we want a linear model type analysis.
```{r facets background smooth linear}
#Faceting
ggplot(data=df, aes(Sepal.Length, y=Sepal.Width,
color=Species)) +
geom_point(aes(shape=Species), size=1.5) +
geom_smooth(method = "lm") +
xlab("Sepal Length") +
ylab("Sepal Width") +
ggtitle("Faceting") +
theme_bw() +
facet_grid(Species~., scales = "free")
```
#### Can We Do Several Regressions at Once?
If we want to see the results of several different regressions at once, we can use the **broom** package to look at a cleaned up version of a grouped regression that we perform using the **tidy** function.
```{r grouped regression}
library(broom)
df %>%
group_by(Species) %>%
do(tidy(
lm(Sepal.Length ~ Sepal.Width, data = .)))
```
We can see that the slopes of the lines are slightly different, but that there is a significant relationship between sepal length and width in all three species (indicated by the relatively low p value).
## Final Notes
Tidyverse is constantly changing, and it's not the end-all-be-all of programming in R. You can absolutely succeed without ever touching any Tidyverse packages, but I think that they're helpful (for now). If you ever run into trouble figuring out how to use Tidyverse things, they have created a series of cheat sheets (https://www.rstudio.com/resources/cheatsheets/) which are VERY useful, especially when you are first starting out.
Good luck and have fun!
## Session Information
In case you all need to know what versions of things I was using:
```{r session info}
sessionInfo()
```