-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path1-Tidyverse.Rmd
251 lines (174 loc) · 7.67 KB
/
1-Tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Data Tidying: Tidyverse Basics"
author: "Drs. Sarangan Ravichandran and Randall Johnson"
output: github_document
---
### Cleaning up
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(knitr)
library(lubridate)
# pull Examples for citing line number:
examples <- readLines('Examples.R')
```
# Table of Contents
1. [Tibbles](#tibbles)
1.1 [Why tibbles?](#why)
1.2 [Working with tibbles](#working)
1.3 [Examples and exercises](#eeTibbles)
2. [Importing Data](#import)
2.1 [Comments and metadata](#skip)
2.2 [Examples and exercises](#eeImport)
# <a name="tibbles"></a>Tibbles
In the tidyverse the commonly returning objects are not data.frame but tibbles, which can be created with either the `tibble()` or `data_frame()` functions.
What is tibble?
- modern way of looking at the traditional data.frame
- you will get a lot more useful information than the data.frames
- tibble is part of tibble package and part of the core tidyverse package
There is a nice vignette for working with tibbles, accessible using this command: `vignette("tibble")`.
How to create a tibble?
```{r create-tibble}
require(dplyr)
tibble(x = 1:5,
y = LETTERS[1:5],
z = x^2 + 20)
```
What is the differences between the base-R `data.frame` and `tibble` (`data_frame`)?
```{r tibble vs data frame}
(df <- data.frame(employee = c('John Wayne','Peter Doe','Esther Julie'),
salary = c(20000, 23400, 26800),
startdate = as.Date(c('2016-12-1','2007-3-25','2016-3-14'))))
as_tibble(df)
```
## <a name="why"></a>Why Tibbles?
- `tibble()` doesn't change the inputs (i.e. it doesn't convert strings to factors).
```{r}
data.frame(x = letters[1:5]) %>%
str() # x converted into a factor
data_frame(x = letters[1:5]) %>%
str() # no auto-conversion
```
- `tibble()` allows the use of variables within the function, making for neater code.
```{r, error = TRUE}
data.frame(x = 1:10,
y = x / 2) %>%
str() # doesn't work
dat <- data.frame(x = 1:10)
dat$y <- dat$x / 2
str(dat)
data_frame(x = 1:10,
y = x / 2) %>%
str()
```
- `data.frame()` does partial string matching without warning you.
```{r}
data.frame(color = "red")$c
data_frame(color = "red")$c
```
- The print method for tibbles is more user friendly.
```{r}
data(who)
who # this is a tibble
```
```{r, eval = FALSE}
as.data.frame(who) # try printing as a data.frame (output not shown here)
```
Why not use a tibble? There are a few packages that don't get along with tibbles (e.g. the missForest package). In this case, you may need to convert your tibble into a data.frame using `as.data.frame()`.
## <a name="working"></a>Working with tibbles
Here is a more complicated tibble, consisting of a random start time within +/- 12 hours of now and a random end time within the next 30 days (where "now" is relative to when this code is run).
```{r complicatedtibble, results= 'asis'}
# lubridate gives us the now() function
require(lubridate)
twelve_hours <- 43200 # seconds
twenty4_hours <- 86400 # seconds
n <- 1000
set.seed(239847)
(t2 <- tibble(
start = now() + # random time within +/- 12 hours of now
runif(n, -twelve_hours, twelve_hours),
end = now() + # random time within the next 30 days
runif(n, 1, 30 * twenty4_hours),
elapsed = as.numeric(end - start,
units = 'hours'), # hours between time_of_day and day
l = sample(letters, n, replace = TRUE) # some letters
))
```
### Adding/changing variables
You can add and change variables within a tibble using `mutate()`. The syntax is nearly identical to `tibble()` and `data_frame()`, except it requires the tibble you want to edit as input. For example, if we want to add a new variable to our data_frame, `t2`:
```{r mutate}
(t2 <- mutate(t2,
case = 1:n <= 500))
```
### Adding rows
You can add rows to a tibble, and using the `.before` option will allow you to specify where exactly to add the data (default, i.e. if you don't specify `.before`, is to put the new data at the end of the tibble).
```{r add rows}
# see our new row on line 2?
(t2 <- t2 %>%
add_row(start = now(),
end = now() + 1,
elapsed = 24,
l = 'f',
.before = 2))
```
### Subsetting
You can use all the same indexing techniques described for data.frames in the [R/RStudio Intro](https://github.com/ravichas/TidyingData/blob/master/0-RStudio-Intro.md), or you can use one of the wrapper functions from the tidyverse:
- filter(): Select specific rows from the `tibble`
```{r filter}
# pull all rows where elapsed number of hours is less than 72
# looks like there are 102 observations (rows) that fit that criterion
filter(t2, elapsed < 72)
```
- select(): Select specific columns from the `tibble`
```{r select}
# pull all start and end times
select(t2, start, end)
# or drop the l column
select(t2, -l)
```
### Printing
You can change the defaults of tibble display with options.
```{r}
tmp <- options()
options(tibble.print_min = 6)
t2
# reset options
options(tmp)
```
You can also use the `tibble.width = Inf` option to print all columns. There are more options documented at `package?tibble`.
## <a name="eeTibbles"></a>Examples and Exercises
For more examples, see line ```r which(examples == "########## tibble Examples ##########")```of [Examples.R](https://github.com/ravichas/TidyingData/blob/master/Examples.R).
Practice exercises for this section can be found in [Exercsies.Rmd](https://github.com/ravichas/TidyingData/blob/master/Exercises.md#tibbleEx).
# <a name="import"></a>Importing Data
RStudio has a nice data import utility under File > Import Dataset. This will generate the code to repeat the import (i.e. so you can save it to your script).

If you are comfortable with writing the code directly, the following functions will import data into tibbles:
- `?read_csv`: import comma separated values data
- `?read_csv2`: import semicolon separated values data (European version of a csv)
- `?read_tsv`: import tab delimited data
- `?read_delim`: import a text file with data (e.g. space delimited)
- `?read_excel`: import Excel formatted data (either xls or xlsx format)
If you are familiar with R you may recognize that there are data.frame generating counterparts from the utils package (e.g. `read.csv()` and `read.delim()`). Why would we want to use these function from the readr package over the base-R functions?
- Speed (~ 10x) - this can make a big difference with very large data sets
- Output from readr is a tibble
- Base R taps into the OS where it is executed, but `readr` functions are OS independent and hence more consistent across platforms
```{r read_csv}
# returns a data.frame
read.csv('Data/WHO-2a.csv')
require(readr)
# returns a tibble
# also, note the helpful warning that several columns have the same name
read_csv('Data/WHO-2a.csv')
```
## <a name="skip"></a>Comments/Metadata
Sometimes, there will be extra metadata at the top of a file, often preceded with '#'. How do we read a data set that has some metadata (indicated by '#')? What if the extra lines aren't properly marked with '#'?
```{r}
# we want to skip this first line
readLines("Data/WHO-2.csv")[1:3] # base package
# ignore metadata row
readr::read_csv("Data/WHO-2.csv", comment = "#")
# this results in identical output, but we specify how many lines to skip
readr::read_csv("Data/WHO-2.csv", skip = 1)
```
## <a name="eeImport"></a>Exercises
Work through the exercises in the Tidyverse section of [Exercises.Rmd](https://github.com/ravichas/TidyingData/blob/master/Exercises.md#tibbleEx).