-
Notifications
You must be signed in to change notification settings - Fork 13
/
M1_2_data_munging.Rmd
1323 lines (935 loc) · 55.6 KB
/
M1_2_data_munging.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'M1-2: Data Munging'
author: "Daniel S. Hain ([email protected])"
date: "06/09/2018"
output:
html_document:
df_print: paged
toc: yes
toc_float: true
number_sections: yes
---
```{r setup, include=FALSE}
### Generic preamble
Sys.setenv(LANG = "en")
### Clean Workspace (I like to start clean)
rm(list=ls()); graphics.off() # get rid of everything in the workspace
detachAllPackages <- function() { # Also, detach packages to avoid functions masked by others
basic.packages <- c("package:stats","package:graphics","package:grDevices","package:utils","package:datasets","package:methods","package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
detachAllPackages(); rm(detachAllPackages)
### checking if all we need is installed
list.of.packages <- c("knitr", "tidyverse", "magrittr", "data.table", "skimr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
rm(list.of.packages, new.packages)
### Load packages Standard
library(knitr) # For display of the markdown
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)
library(data.table) # Good format to work with large datasets
library(skimr) # Nice descriptives
### Knitr options
opts_chunk$set(warning=FALSE,
message=FALSE,
fig.align="center"
)
```
# Introduction
Welcome to your first SDS session. In this session, you will learn the basic grammar of data manipulation, some best-practice advices
In our OWSEMN pipeline, we will here focus mainly in
* **O:** Obtaining our data
* **S:** Scrubbing / Cleaning our data
Particular focus will be on **S**, while we keep the **O** parts simple. More tricky data-gathering settings will be encountered in later sessions. Since data manipulation always follows a purpose and requires some understanding of the data at hand, we will also have a first glance in **E** (exploration and visualization). However, we will her only cover the very basics and skip most of the details. Again, you will have a dedicated sessions lateron.
In this session, you will learn:
* How to do basic variable filtering, selection, and manipulation
* How to create various types of data summarization
* How to also apply these actions on grouped data
* How to join data from different sources
* How to reshape your data
* How to deal with missing values
From my experience, this covers ca. 95% of common data manipulation tasks. Sound like fun? Lets get started!
# Before we start: Some digression on data manipulation workflows
## Reminder: Base-R
### Basics
#### Assignments
You can assign a value to an object using `assign()`, `<-`, or `=`.
```{r}
x <- 3 # Assignment
x # Evaluate the expression and print result
y <- 4 # Assignment
y + 5 # Evaluation, y remains 4
z <- x + 17*y # Assignment
z # Evaluation
rm(z) # Remove z: deletes the object.
# z # Error!
```
#### Value comparisons
Comparisons return boolean values: TRUE or FALSE (often abbreviated to T and F)
```{r}
2==2 # Equality
2!=2 # Inequality
x <= y # less than or equal: "<", ">", and ">=" also work
```
#### Special constants
NA, NULL, Inf, -Inf, NaN
```{r}
# NA - missing or undefined data
5 + NA # When used in an expression, the result is generally NA
is.na(5+NA) # Check if missing
# NULL - an empty object, e.g. a null/empty list
10 + NULL # use returns an empty object (length zero)
is.null(NULL) # check if NULL
# Inf and -Inf represent positive and negative infinity
# They can be returned by mathematical operations like division of a number by zero:
5/0
is.finite(5/0) # Check if a number is finite
# NaN (Not a Number) - the result of an operation that cannot be reasonably defined
0/0
is.nan(0/0)
```
### Object classes
#### Vectors
```{r}
v1 <- c(1, 5, 11, 33) # Numeric vector, length 4
v2 <- c("hello","world") # Character vector, length 2 (a vector of strings)
v3 <- c(TRUE, TRUE, FALSE) # Logical vector, same as c(T, T, F)
```
Combining different types of elements in one vector will coerce the elements to the least restrictive type:
```{r}
v4 <- c(v1,v2,v3,"boo") # All elements turn into strings
```
Other ways to create vectors:
```{r}
v <- 1:7 # same as c(1,2,3,4,5,6,7)
v <- rep(0, 77) # repeat zero 77 times: v is a vector of 77 zeroes
v <- rep(1:3, times=2) # Repeat 1,2,3 twice
v <- rep(1:10, each=2) # Repeat each element twice
v <- seq(10,20,2) # sequence: numbers between 10 and 20, in jumps of 2
length(v) # check the length of the vector
v1 <- 1:5 # 1,2,3,4,5
v2 <- rep(1,5) # 1,1,1,1,1
```
Element-wise operations:
```{r}
v1 + v2 # Element-wise addition
v1 + 1 # Add 1 to each element
v1 * 2 # Multiply each element by 2
v1 + c(1,7) # This doesn't work: (1,7) is a vector of different length
```
Mathematical operations:
```{r}
sum(v1) # The sum of all elements
mean(v1) # The average of all elements
sd(v1) # The standard deviation
cor(v1,v1*5) # Correlation between v1 and v1*5
```
Logical operations:
```{r}
v1 > 2 # Each element is compared to 2, returns logical vector
v1==v2 # Are corresponding elements equivalent, returns logical vector.
v1!=v2 # Are corresponding elements *not* equivalent? Same as !(v1==v2)
(v1>2) | (v2>0) # | is the boolean OR, returns a vector.
(v1>2) & (v2>0) # & is the boolean AND, returns a vector.
(v1>2) || (v2>0) # || is the boolean OR, returns a single value
(v1>2) && (v2>0) # && is the boolean AND, ditto
```
Vector elements:
```{r}
v1[3] # third element of v1
v1[2:4] # elements 2, 3, 4 of v1
v1[c(1,3)] # elements 1 and 3 - note that your indexes are a vector
v1[c(T,T,F,F,F)] # elements 1 and 2 - only the ones that are TRUE
v1[v1>3] # v1>3 is a logical vector TRUE for elements >3
```
**NOTE:** If you are used to languages indexing from 0, R will surprise you by indexing from 1.
To add more elements to a vector, simply assign them values.
```{r}
v1[6:10] <- 6:10
```
We can also directly assign the vector a length:
```{r}
length(v1) <- 15 # the last 5 elements are added as missing data: NA
```
#### Factors --------
Factors are used to store categorical data.
```{r}
eye.col.v <- c("brown", "green", "brown", "blue", "blue", "blue") #vector
eye.col.f <- factor(c("brown", "green", "brown", "blue", "blue", "blue")) #factor
eye.col.v
eye.col.f
```
R will identify the different levels of the factor - e.g. all distinct values. The data is stored internally as integers - each number corresponding to a factor level.
```{r}
levels(eye.col.f) # The levels (distinct values) of the factor (categorical variable)
as.numeric(eye.col.f) # The factor as numeric values: 1 is blue, 2 is brown, 3 is green
as.numeric(eye.col.v) # The character vector, however, can not be coerced to numeric
as.character(eye.col.f)
as.character(eye.col.v)
```
#### Matrces & Arrays
A matrix is a vector with dimensions:
```{r}
m <- rep(1, 20) # A vector of 20 elements, all 1
dim(m) <- c(5,4) # Dimensions set to 5 & 4, so m is now a 5x4 matrix
```
Create a matrix using `matrix()`:
```{r}
m <- matrix(data=1, nrow=5, ncol=4) # same matrix as above, 5x4, full of 1s
m <- matrix(1,5,4) # same matrix as above
dim(m) # What are the dimensions of m?
```
Create a matrix by combining vectors:
```{r}
m <- cbind(1:5, 5:1, 5:9) # Bind 3 vectors as columns, 5x3 matrix
m <- rbind(1:5, 5:1, 5:9) # Bind 3 vectors as rows, 3x5 matrix
m <- matrix(1:10,10,10)
```
Select matrix elements:
```{r}
m[2,3] # Matrix m, row 2, column 3 - a single cell
m[2,] # The whole second row of m as a vector
m[,2] # The whole second column of m as a vector
m[1:2,4:6] # submatrix: rows 1 and 2, columns 4, 5 and 6
m[-1,] # all rows *except* the first one
m[1,]==m[,1] # Are elements in row 1 equivalent to corresponding elements from column 1?
m>3 # A logical matrix: TRUE for m elements >3, FALSE otherwise
m[m>3] # Selects only TRUE elements - that is ones greater than 3
t(m) # Transpose m
m <- t(m) # Assign m the transposed m
m %*% t(m) # %*% does matrix multiplication
m * m # * does element-wise multiplication
```
#### Arrays: more than 2 dimensions
Created with the `array()` function:
```{r}
a <- array(data=1:18,dim=c(3,3,2)) # 3d with dimensions 3x3x2
a <- array(1:18,c(3,3,2)) # the same array
```
#### Lists
Lists are collections of objects (e.g. of strings, vectors, matrices, other lists, etc.)
```{r}
l1 <- list(boo=v1,foo=v2,moo=v3,zoo="Animals!") # A list with four components
l2 <- list(v1,v2,v3,"Animals!")
l3 <- list()
l4 <- NULL
l1["boo"] # Access boo: this returns a list.
l1[["boo"]] # Access boo: this returns the numeric vector
l1[[1]] # Returns the first component of the list, equivalent to above.
l1$boo # Named elements can be accessed using the $ operator - equivalent to [[]]
```
Add more elements to a list:
```{r}
l3[[1]] <- 11 # add an element to the empty list l3
l4[[3]] <- c(22, 23) # add a vector as element 3 in the empty list l4.
# Since we added element 3, elements 1 & 2 will be generated and empty (NULL)
l1[[5]] <- "More elements!" # The list l1 had 4 elements, we're adding a 5th here.
l1[[8]] <- 1:11 # We added an 8th element, but not 6th or 7th. Those will be created empty (NULL)
l1$Something <- "A thing" # Adds a ninth element - "A thing", named "Something"
```
#### Data Frames
The data frame is a special kind of list used for storing dataset tables. Think of rows as cases, columns as variables. Each column is a vector or factor.
Creating a dataframe:
```{r}
dfr1 <- data.frame( ID=1:4,
FirstName=c("John","Jim","Jane","Jill"),
Female=c(F,F,T,T),
Age=c(22,33,44,55) )
dfr1$FirstName # Access the second column of dfr1.
```
Notice that R thinks this is a categorical variable and so it's treating it like a factor, not a character vector.
Let's get rid of the factor by telling R to treat FirstName as a vector:
```{r}
dfr1$FirstName <- as.vector(dfr1$FirstName)
```
Alternatively, you can tell R you don't like factors from the start using `stringsAsFactors=FALSE`
```{r}
dfr2 <- data.frame(FirstName=c("John","Jim","Jane","Jill"), stringsAsFactors=FALSE)
dfr2$FirstName # Success: not a factor.
```
Access elements of the data frame
```{r}
dfr1[1,] # First row, all columns
dfr1[,1] # First column, all rows
dfr1$Age # Age column, all rows
dfr1[1:2,3:4] # Rows 1 and 2, columns 3 and 4 - the gender and age of John & Jim
dfr1[c(1,3),] # Rows 1 and 3, all columns
```
Find the names of everyone over the age of 30 in the data
```{r}
dfr1[dfr1$Age>30,2]
```
Find the average age of all females in the data:
```{r}
mean ( dfr1[dfr1$Female==TRUE,4] )
```
### Flow Control (loops & friends)
`if (condition) expr1 else expr2`
```{r}
x <- 5; y <- 10
if (x==0) y <- 0 else y <- y/x
y
```
`for (variable in sequence) expr`
```{r}
ASum <- 0; AProd <- 1
for (i in 1:x)
{
ASum <- ASum + i
AProd <- AProd * i
}
ASum # equivalent to sum(1:x)
AProd # equivalemt to prod(1:x)
```
`while (condintion) expr`
```{r}
while (x > 0) {print(x); x <- x-1;}
```
`repeat expr, use break to exit the loop`
```{r}
repeat { print(x); x <- x+1; if (x>10) break}
```
### R troubleshooting
While I generate many (and often very creative) errors in R, there are three simple things that will most often go wrong for me. Those include:
* Capitalization. R is case sensitive - a graph vertex named "Jack" is not the same as one named "jack". The function `rowSums` won't work as "rowsums" or "RowSums".
* Object class. While many functions are willing to take anything you throw at them, some will still surprisingly require character vector or a factor instead of a numeric vector, or a matrix instead of a data frame. Functions will also occasionally return results in an unexpected format.
* Package namespaces. Occasionally problems will arise when different packages contain functions with the same name. R may warn you about this by saying something like "The following object(s) are masked from 'package:igraph'" as you load a package. One way to deal with this is to call functions from a package explicitly using '::'. For instance, if function 'blah' is present in packages A and B, you can call A::blah and B::blah. In other cases the problem is more complicated, and you may have to load packages in certain order, or not use them together at all. For example (and pertinent to this workshop), igraph and statnet packages can cause some problems when loaded at the same time. It is best to detach one before loading the other.
```{r}
# library(dplyr) # load a package
# detach(package:dplyr) # detach a package
```
For more advanced troubleshooting, check out `try()`, `tryCatch()`, and `debug()`.
```{r}
?tryCatch
```
Generally, just using `?functionyouwonderabout` often solves problems. There you can review the functions arguments, inputs, outputs, syntax etc.
```{r}
?merge
```
# Introducing [`dplyr`](https://dplyr.tidyverse.org/) workflows
Base R comes with quite some functionality for slicing and dicing data, there also exists a myriad specialized packages for more tricky data manipulation. To read others' code and example as well as to perform some special operations, you all should be able to use standard R syntax. Lets create a super-simple dataframe and slice it a bit. Lets imagine we have 10 persons with a corresponging numeric id, and some value, such as an IQ score. Soem are members of group A, and some of group B or C. We generate that all somnewhat random:
```{r}
x <- data.frame(id = 1:10,
group = sample(LETTERS[1:3], 10, replace = TRUE),
score = rnorm(10, mean = 0, sd = 1) )
x
```
Now, lets imaging we would like to sort, subset or aggregate it, we would do stuff like:
```{r}
# ordering by score
x[order(x$score),]
```
or
```{r}
# selecting only positive scores
x[x$score >= 0,]
```
or
```{r}
# replacing negative scores with zero
x[x$score < 0, "score"] <- 0
```
while that is all possible, the `[row, column]` syntax is not very comfortable. Further, for more tricky operation such as certain aggregations etc., one has to rely on a variety of packages, which often come with an own syntax.
The good news is: The efforts of a small set of key-developers (foremost [Hadley Wickham](http://hadley.nz/)) has let to the development of the [`tidyverse`](https://www.tidyverse.org/), an ecosystem of R packages particularly designed for data science applications. All packages share an underlying design philosophy, common API, grammar, and data structures.
Among the most amazing contributions here is [`dplyr`](https://dplyr.tidyverse.org/), a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. I use `dplyr` for 90% of my data-manipulation tasks for the following reasons:
* All the underlying code is runs optimized in `C++`, making it faster than most base R
* It consistently unifies the grammar of data manipulation to a small set of operations, which can be flexibly combined to master almost every task
* It is designed to work neathly with the `%>%` pipe-operator of [magrittr](hhttps://magrittr.tidyverse.org/) (more on that later)
* its syntax is very similar to the logic of `SQL` and other data-management languages
* It expanded far beoyond its original 5 verbs, and now replaces most base R commands with optimized, clever, and high-performance altgernatives
* It works neathly with many databases, such as `SQL`
I will not touch on all packages there, but the complete `tidyverse` covers almost all issues of data manipulation. They all operate under the same logic, are fast, and usually your best choice for almost any given problem. Particularly `dplyr` is enourmeously powerfull, and has a lot more functions than the basics I cover here. So, for every given probloblem, your first question (to yourself or [stackoverflow](https://stackoverflow.com/)) should be:
1: Is there a way to solve my problem in `dplyr`?
2: If not, is there another `tidyverse` package dedicated to this problem?
## General syntax
So, let's zoom in the `dplyr` functionality and syntax for a moment, before we illustrate it at the following data-manipulation tasks.
dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. Install the `dbplyr` package then read `vignette("databases", package = "dbplyr")`.
First, let's install (if necessary) and load it. For the sake of illustration, I will load every package of the `tidyverse` one-by-one when we need it. However, normally I just load `library(tidyverse)` all at once, since I need a lot of these packages often anyhow
```{r,warning=FALSE}
# # The easiest way to get dplyr is to install the whole tidyverse, including GGplot2 and all the stuff we will need later anyhow:
# install.packages("tidyverse")
# library(tidyverse) # To load the whole compilation of packages
library(dplyr)
library(magrittr) # For some advanced piping operations not included in dplyr
```
### Piping in `dplyr`
![Cleaning](media/m1_pipe.jpg)
in traditional R syntax, data-manipulations are carried out one by one. For example, one would first assign a new variable `x$numbers <- 1:5`, then maybe manipulate it `x$numbers <- x$numbers * 2`, and subset it `x <- x[x$numbers > 4]`. `dplyr` makes use of `margrittr`'s pipes, written like `%>%`. A pipe means take the output of it's left-hand side and insert it as first input in the function on the right-hand side. Accordingly, all `dplyr` functions follow the syntax that their first input is always the data to be manipulated. Therefore, they can all be "piped".
The example before we could pipe as follows:
```{r, eval = FALSE}
x <- data_frame(numbers = 1:5) %>%
mutate(numbers = numbers * 2) %>%
filter(numbers > 4)
```
It basically reads lite: Create a dataframe (to be precise, a `tibble`) with the variable "numbers" and assign the values 1:5 THEN multiply them with 2 THEN subset the dataframe to only rows with a nuimber value higher than 4.
It first looks not so intuitive, but it will become your second nature. Using pipes facilitates fast, reproducible and easily readable coding practices, and all of you are encouraged to go on with that.
One note: `%>%` pipes do not autometically assign their output to the left-hand side object, meaning the original dataset will not per se be overwritten. To do that, there are two ways:
1: Initially, assign the output to the original data with `<-`
2: Initially, use `margrittr`'s `%<>%` command, meaning: Assign and pipe.
```{r, eval = FALSE, warning = FALSE}
x <- x %>%
filter(numbers > 5)
x %<>%
filter(numbers > 8)
```
In conclusion:
The pipe basically passes on dataframe between functions in the following way:
```{r, eval = FALSE, warning = FALSE}
x %>% fun(y)
# Is equivalent to
fun(x, y)
# While
x %<>% fun(y)
# Is equivalent to
x <- fun(x, y)
```
### The 5 core verbs of data-manipulation
At ity's core are 5 central verbs of data/manipulation, which are
1. `filter()` picks cases based on their values.
2. `select()` picks variables based on their names.
3. `arrange()` changes the ordering of the rows.
4. `mutate()` adds new variables that are functions of existing variables
5. `summarise()` reduces multiple values down to a single summary.
These all combine naturally with `group_by()` which allows you to perform any operation "by group". You can learn more about them in `vignette("dplyr")`. As well as these single-table verbs, dplyr also provides a variety of two-table verbs (joins and mutations), which you can learn about in `vignette("two-table")`.
We will explore now a bit with the starwars dataset, which comes together with `dplyr`
```{r}
head(starwars) # The "startwars" dataset is an integrated dplyr goddie :)
```
For a first inspection of new data, I like the corresponding function in the little `skimr` package, which gives a nice brief summary on data, types, and values.
```{r}
# install.packages("skimr")
library(skimr)
skim(starwars)
```
#### 1: `filter()`
The verb `filter()` lets you subset a dataframe by rows (observations), meaning the output will filter for only rows which fulfill a certain condition.
![Filter](media/m1_dplyr_filter.png)
```{r}
starwars %>%
filter(species == "Droid")
```
Conditions can be logically combined with boolean operators (`&, |`)
```{r}
starwars %>%
filter(homeworld == "Tatooine" & species == "Droid")
```
#### 2: `select()`
The verb `select()` lets you subset a dataframe by column (variable), meaning the output will only contain certain columns in the stated order
```{r}
starwars %>%
select(name, homeworld)
```
It can also be used to deselect columns with a leading `-`
```{r}
starwars %>%
select(-birth_year, -skin_color)
```
Lastly, many of `dplyr`'s other conditions can be applied to select or deselect columns. More on that later.
```{r}
starwars %>%
select(name, ends_with("color"))
```
#### 3: `arrange()`
The verb `arrange()` defines the way the rows of your dataframe are ordered
![Arrange](media/m1_dplyr_arrange.png)
```{r}
starwars %>%
arrange(homeworld, desc(mass))
```
#### 4: `mutate()`
The verb `mutate()` lets you manipulate existing variables or create new ones.
![Mutate](media/m1_dplyr_mutate.png)
```{r}
starwars %>%
mutate(bmi = mass / ((height / 100) ^ 2),
mass.rel = mass / max(mass, na.rm = TRUE)) %>%
select(name:mass, mass.rel, bmi) %>%
arrange(desc(bmi))
```
#### 5: `summarize()`
The verb `summarize()` reduces your dataset to one observation, which is summarized according to a defined function.
![Summarize](media/m1_dplyr_summarize.png)
```{r}
starwars %>%
summarize(min.height = min(height, na.rm = TRUE),
mean.height = mean(height, na.rm = TRUE),
max.height = max(height, na.rm = TRUE),
sd.height = sd(height, na.rm = TRUE) )
```
#### Doing manipulation by group: `group_by`
Finally, `by_group()` is offers the perfect complement to the 5 basic verbs, since it allows to perform aqll the above mentioned tasks sepperate according to used-defined groupings in one or multiple categories. Lets look at some examples:
![Group](media/m1_dplyr_group.png)
First of all, it works amazingly well with the `summarize()` verb, producing within group summaries
```{r}
starwars %>%
group_by(homeworld) %>%
summarize(mean.height = mean(height, na.rm = TRUE)) %>%
ungroup() %>%
arrange(desc(mean.height))
```
Note that it is adviced to finish always such operations with `ungroup()`. Otherwise, the still active grouping might at a later point lead to all kind of unforseen results.
### Your turn
Some little exercises in between: Lets go to this [Kaggle notebook exercise](https://www.kaggle.com/danielhain/sds-2018-m1-2-data-munging-exercise-1) and show what you learned
##### Two-table verbs
It's rare that a data analysis involves only a single table of data. In practice, you'll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In `dplyr`, there are three families of verbs that work with two tables at a time:
1. Mutating joins, which add new variables to one table from matching rows in another.
2. Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table.
3. Set operations, which combine the observations in the data sets as if they were set elements.
Among those, mutating joins are the by far most used ones, so we will briefly discuss them here, and cover other types later in the applied part. For questions right away, consult `vignette(two-"two-table")`. Mutating joins allow you to combine variables from multiple tables (merging data). We here assume our data is "tidy" (meaning: rows = observations, columns = variables).
Lets briefly create some example data:
```{r}
df1 <- data_frame(x = c(1, 2), y = 2:1)
df2 <- data_frame(x = c(1, 3), a = 10, b = "a")
df1
df2
```
There are 4 types of mutating join to combine variables from dataframe `x` and `y, which differ in their behaviour when a match is not found. We'll illustrate each with a simple example:
1. `left_join(x, y)` includes all observations in `x`, regardless of whether they match or not. This is the most commonly used join because it ensures that you don't lose observations from your primary table.
```{r}
df1 %>% left_join(df2)
```
2. `right_join(x, y)` includes all observations in `y`. It's equivalent to `left_join(y, x)`, but the columns will be ordered differently.
```{r}
df1 %>% right_join(df2)
```
```{r}
df2 %>% left_join(df1)
```
3. `inner_join(x, y)` only includes observations that match in both `x` and `y`.
```{r}
df1 %>% inner_join(df2)
```
4. `full_join()` includes all observations from `x` and `y`.
```{r}
df1 %>% full_join(df2)
```
The left, right and full joins are collectively know as **outer joins**. When a row doesn't match in an outer join, the new variables are filled in with missing values.
While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:
```{r}
df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% left_join(df2)
```
Note: In case of multiple matches in `y`, adittional rows will be added. Therefore, if that is not the behavior you want, use the `distinct()` function afterwards.
By default, joins are performed on the bases of **all** variables appearing in `x` and `y`. You can also specify one or multiple variables that should be matched by using the `by` argument. This can also be used to match on variables with different names in `x` and `y`.
```{r}
df1 <- data_frame(x = c(3, 2, 1), y = 1:3, z = c(1,3,4))
df2 <- data_frame(x = c(1, 2, 3), y = 2:4, q = c("a", "b", "a"))
df1 %>% left_join(df2, by = "x")
df1 %>% left_join(df2, by = c("x", "y"))
df1 %>% left_join(df2, by = c("x" = "y"))
```
#### Final remarks
Even though that doesnt sound too much, when combining them right, these basic verbs will enable you to do ca 80% of common data manipulation tasks.
![Cleaning](media/m1_tidy_workflow.png)
```{r, influde = FALSE}
rm(df1, df2, x)
```
#### Your turn
Some little exercises in between: Lets go to this [Kaggle notebook exercise](https://www.kaggle.com/danielhain/sds-2018-m1-2-data-munging-exercise-2) and show what you learned!
# Case Study: Cleaning up historical data on voting of the United Nations General Assembly
So, after we refreshed the basic grammar of data-manipulation, we will put it to a test in the following example. We will step-by-step explore the power of the basic grammar of data-manipulation.
We will use the "UN GEneral Assembly Voting" dataset, and explore the pattern and some underlying mechanisms of country voting choices. Lets get started!
![UNGA](media/m1_un.jpeg)
Source: Erik Voeten "Data and Analyses of Voting in the UN General Assembly" Routledge Handbook of International Organization, edited by Bob Reinalda (published May 27, 2013). Available at SSRN: http://ssrn.com/abstract=2111149
## Step 1: Optaining the deta
Usually, we would now have to download some data online, access it via an API or a local database, or something like that. The full dataset is available at [Harvard's DataVerse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379). For a first warm up, we will skip this part, and use the following data I conveniently prepared for you.
A little hint here in case you want to load a `.csv` and stuff the like: The `data.table` package was a really nice function called `fread()` ( stands for "friendly read"), which by now replases mostly `R`'s standard `read.table`, `read.csv()` ect. since it is just faster, automatically recognizes most data format, and codes the variables smarter. However, back to our data. For my own workflow in R, I prefer to save datasets as `.rds` file, which creates a serialized version of the dataset and then saves it with gzip compression. Thatreduces the disk-space of datasets, while they are still read faster than for example a `.csv`.
```{r}
votes <- readRDS("data/UN_votes.rds")
```
## Step 2: First Inspection
Lets first have a look what's in the dataset.
```{r}
votes %>% head(10)
```
We see that the dataset is with 508,929 observations already somewhat large. We find a set of 4 variables. Doesn't sound so exiting on first glance. Anyhow, lets talk for a moment about the variables we have:
* **`rcid:`** The roll-call ID, serving as an unique identifyer for the resolution vouted on.
* **`session:`** The number of the session of the annual UNGA, starting with 1 for the historically first meeting
* **`vote:`** A numerical code representing the country's choice of vote:
* 1 = Yes
* 2 = Abstain
* 3 = No
* 8 = Not present
* 9 = Not a member
* **`ccode:`** A numerical code to identify the country
So, to sum up:
* Every row contains a voting choice for a particular resolution on country level.
* Consequently, we will for every resulution have a single row for every country.
* All variables are coded numerically
## Step 3: Preprocessing
Whatever data you obtian, it usually needs a bit of manual "massage" to be ready for further analysis. Common steps here are
1. Check variable format (dates, strings, numeric etc.)
2. Check variable scaling (range, outliers)
3. Deal with missing values
While there are some more issues you could run into, we in the following go through a couple of common problems and see some easy ways how to solve them.
### Filtering
One step of data cleaning is removing observations (rows) that you're not interested in. In this case, you want to remove "Not present" and "Not a member". Since we are for now not interested in them, lets get rid of them
```{r}
votes %>% filter(vote <= 3) %>% head(10)
```
### Adding a year column
The next step of data cleaning is manipulating your variables (columns) to make them more informative. In this case, you have a session column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session number.
```{r}
votes %>%
mutate(year = session + 1945) %>% head(10)
```
### Adding a country column
The country codes in the ccode column are what's called [Correlates of War codes](http://cow.dss.ucdavis.edu/data-sets/cow-country-codes), a numerical coding convention for country names. This isn't ideal for an analysis, since you'd like to work with recognizable country names.
For this and similar cases, the R package [`countrycode`](https://github.com/vincentarelbundock/countrycode) is really practical, since it allows for fast and easy ways to switch between many standard country-coding conventions.
```{r}
# # In case the package is not installed yet, do that first:
# install.packages("countrycode")
library(countrycode)
votes %>%
mutate(country = countrycode(ccode, "cown", "country.name")) %>% head(10)
```
Here,we tell `countrycode` to transfer the Correlates of War Code ("cown") to the easily readable country name ("country.name"). Other transformations such as to ISO2 and ISO3 alphanumerical codes are also possible. Neath, isn't it?
Note that we up to now never changed the original `votes` data. All pipes upto now just created an output for illustration. To really change the data, we have to use the assign-and-pipe `%<>%`, or the manual assignment `<-`. So, lets finish this cleaning up and pipe the whole preprocessing all together. Since not all counrty codes matched a country name, we will also filter the unmatchedc out
```{r}
votes %<>%
filter(vote <= 3) %>%
mutate(year = session + 1945,
country = countrycode(ccode, "cown", "country.name")) %>%
filter(!is.na(country))
votes %>% head(10)
```
## Generating first insights
###Summarizing the full dataset
In this analysis, we're going to focus on "% of votes that are yes" as a metric for the "agreeableness" of countries. we'll start by finding this summary for the entire dataset: the fraction of all votes in their history that were "yes". Note that within your call to `summarize()`, you can use `n()` to find the total number of votes and `mean(vote == 1)` (since the output of the logical operation is FALSE or TRUE, which is equivalent to 0 or 1) to find the fraction of "yes" votes.
```{r}
votes %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
```
Ok, we see that in general, countries tend to be "aggreable". While nice to know, this is not terribly informative. As always, single numbers tell us little, trends, context, and comparison does usually provide more insights. So lets start with putting this number in perspective.
### Summarizing by year
The `summarize()` function is especially useful when used within groups. Here, it gives not a overal summary, but one for every group instance. For example, we might like to know how much the average "agreeableness" of countries changed from year to year. To examine this, you can use `group_by()` to perform your summary not for the entire dataset, but within each year.
```{r}
year <- votes %>%
arrange(year) %>%
group_by(year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
year %>% head(10)
```
On first galance, it looks like the overall "agreeableness" seems to increase over time.
### Summarizing by country
In the last exercise, you performed a summary of the votes within each year. You could instead summarize() within each country, which would let you compare voting patterns between countries. For further analysis, we wil create a new dataframe on country level.
```{r}
country <- votes %>%
group_by(country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
country %>% head(10)
```
Let's inspect this country level voting pattern by looking at the most and least agreable countries by arranging the dataframe.
```{r}
country %>%
arrange(desc(percent_yes)) %>% head(10)
country %>%
arrange(percent_yes) %>% head(10)
```
Here we have the top-10 agreable and non-agreable countries (note the use of `slice()`, one of the many advanced `dplyr` verbs). We see that Zanzibar is an absolutely non-agreeing country. Yet, we also see that they only participated in 2 votes. So, is that number reliable and informative? That is a classical example of how we might get misleading results when working with aggregated data without thorroughly investigating it upfront.
So, in conclusion: We might want to get rid of countries with very litte total votes.
```{r}
country %<>%
filter(total >= 100)
```
## First (petite) graphical exploration
Allright, lets go on with a little analysis of time trends. Here, we will do a bit more graphical exploration. Therefore, lets have a mini-introduction to graphical plotting. Since this will be
### Digression: The Grammar of graphics, and [`ggplot2`](https://ggplot2.tidyverse.org/)
ggplot2 can be thought of as a mini-language (domain-specific language) within the R language. It is an R implementation of [Wilkinson's Grammar of Graphics book](https://www.springer.com/gp/book/9780387245447). [A Layered Grammar of Graphics](http://vita.had.co.nz/papers/layered-grammar.pdf) describes Hadley's implementation of these thoughts in the ggplot2's design. Conceptually, the main idea behind the Grammar of Graphics is that a statistical graphic is a mapping from variables to aesthetic attributes (x axis value, y axis value, color, shape, size) of geometric objects (points, line, bars). While the Grammar of Graphic contains more elements, we will focus in this brief intro in the two main ones, aestetics and geometries.
* **Aestetics:** Devine the "surface" of your plot, in terms of what has to be mapped (size, coplor) on the x and y (and potentially adittional) axes. Aesteticts are defined within the `aes()` function.
* **Geometries:** Visual elements you can see in the plot itself, such as bars, lines, and points. They are defined within various `geom_XYZ()` functions.
Basically, you define a surface grid and then plot something on top. We will talk about all of that in depth in later sessions, for now that's all you need to know to understand the following simple examples.
### Plotting by year
So, lets graphically explore the development of agreeability over time. The code below is very simple. We pass our data `year` to the `ggplot()` function, which here only contains the basic aestetics `aes()`. In this case, we would like to map the `year` on the `x`, and the value of the `percent_yes` variable on the `y` axis. That defines the plot's surface. Then, we use the `+` to add a geometric element, in this case a line-plot.
```{r}
library(ggplot2)
year %>%
ggplot(aes(x = year, y = percent_yes) ) +
geom_line()
```
Looks like the trend of increasing agreeability sort of peaked in the 80s.
### Summarizing by year and country
Ok, that was nice so far, even though not terribly interesting, since it's hard to interpret aggregated numbers of almost all the world's countries combined. More tangible insights we might get when zooming in a single country, or comparing trends over a set of countries. Therefore we need a bit of a different data structure. We now need one observation not for year or country, but for every country and year. Fortunatelly, that's easily done via multiple groupings.
```{r}
year_country <- votes %>%
group_by(year, country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
year_country %>% head(10)
```
Ok, with this datastructure, we can do some interesting analysis. For example, we can only look at the development of Danish votes by filtering.
```{r}
year_country %>%
filter(country == "Denmark") %>%
ggplot(aes(x = year, y = percent_yes) ) +
geom_line()
```
#### Digression: The `%in%` operator
In case we want to compare a set of countries now, it would be convenient if we could select many of them in an easy way. Here, the `%in%` operator is handy. It basically just takes two vectors `x` and `y`, and for every element in `x` returns a Logical indication of this element is also contained in `y`. Example:
```{r}
c("the", "cake", "is", "a", "lie") %in% c("I", "would", "love", "to", "eat", "a", "cake")
```
We can following the same idea create a vector of country anmes we want to compare, and then filter our dataset for these countries.
```{r}
countries <- c("United States", "China", "France", "Denmark")
year_country %>%
filter(country %in% countries) %>% head(10)
```
So,l lets plot these countries. To have them all displayed in one plot, we could either create a `geom_line()` for every subset (would be silly), or define in ggplot's `aes()` that the categorical variable `country` should be mapped with different colors. Therefore, for every country, a sepperate line in a different color will be created.
```{r}
year_country %>%
filter(country %in% countries) %>%
ggplot(aes(year, percent_yes, color = country)) +
geom_line()
```
We indeed see differences between countries. While denmark seens to be somewhat stable around 50% agreement and disagreement (the Danish style...), the USA over time moved to a very low rate of agreement. China, which got its UN seat just in the 70s, appears to be the mosyt agreeable country in our small sample.
We could obviously go on with all kind of similar graphical analysis, but I do not want to spoiler too much of what is about to come in later sessions. So, lets leave it with that.
## Joining our data with contextual information
Up to now, we just looked at general voting pattern, and then a bit more nuanced at pattern over time and across countries. What the specific votes whewre about, we up to now did not really care. However, every vote corresponds to an important geo- and sociopolitic issue. So, it's time to bring in a bit of context. Therefore, we will now also draw from an adittional dataset with more detailed information on the content of the corresponding vote, which we up to now treated as pure numbers. Lets take a look:
```{r}
descriptions <- readRDS("data/UN_votes_descriptions.rds")
descriptions %>% head(10)
```