forked from tidymodels/corrr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
128 lines (90 loc) · 4.43 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# corrr <a href='https://corrr.tidymodels.org'><img src='man/figures/logo.png' align="right" height="139" /></a>
[](https://github.com/tidymodels/corrr/actions)
[](https://travis-ci.org/tidymodels/corrr)
[](https://cran.r-project.org/package=corrr)
[](https://codecov.io/gh/tidymodels/corrr?branch=master)
corrr is a package for exploring **corr**elations in **R**. It focuses on creating and working with **data frames** of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the [tidyverse](http://tidyverse.org/). This, along with the primary corrr functions, is represented below:
<img src='man/figures/to-cor-df.png'>
You can install:
- the latest released version from CRAN with
```{r install_cran, eval = FALSE}
# install.packages("corrr")
```
- the latest development version from GitHub with
```{r install_git, eval = FALSE}
# install.packages("remotes")
# remotes::install_github("tidymodels/corrr")
```
## Using corrr
Using `corrr` typically starts with `correlate()`, which acts like the base correlation function `cor()`. It differs by defaulting to pairwise deletion, and returning a correlation data frame (`cor_df`) of the following structure:
- A `tbl` with an additional class, `cor_df`
- An extra "rowname" column
- Standardized variances (the matrix diagonal) set to missing values (`NA`) so they can be ignored.
### API
The corrr API is designed with data pipelines in mind (e.g., to use `%>%` from the magrittr package). After `correlate()`, the primary corrr functions take a `cor_df` as their first argument, and return a `cor_df` or `tbl` (or output like a plot). These functions serve one of three purposes:
Internal changes (`cor_df` out):
- `shave()` the upper or lower triangle (set to `r NA`).
- `rearrange()` the columns and rows based on correlation strengths.
Reshape structure (`tbl` or `cor_df` out):
- `focus()` on select columns and rows.
- `stretch()` into a long format.
Output/visualizations (console/plot out):
- `fashion()` the correlations for pretty printing.
- `rplot()` the correlations with shapes in place of the values.
- `network_plot()` the correlations in a network.
## Databases and Spark
The `correlate()` function also works with database tables. The function will automatically push the calculations of the correlations to the database, collect the results in R, and return the `cor_df` object. This allows for those results integrate with the rest of the `corrr` API.
## Examples
```{r example, message = FALSE, warning = FALSE}
library(MASS)
library(corrr)
set.seed(1)
# Simulate three columns correlating about .7 with each other
mu <- rep(0, 3)
Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3
seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)
# Simulate three columns correlating about .4 with each other
mu <- rep(0, 3)
Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6
four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)
# Bind together
d <- cbind(seven, four)
colnames(d) <- paste0("v", 1:ncol(d))
# Insert some missing values
d[sample(1:nrow(d), 100, replace = TRUE), 1] <- NA
d[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA
# Correlate
x <- correlate(d)
class(x)
x
```
As a `tbl`, we can use functions from data frame packages like `dplyr`, `tidyr`, `ggplot2`:
```{r, message = FALSE, warning = FALSE}
library(dplyr)
# Filter rows by correlation size
x %>% filter(v1 > .6)
```
corrr functions work in pipelines (`cor_df` in; `cor_df` or `tbl` out):
```{r combination, warning = FALSE, fig.height = 4, fig.width = 5}
x <- datasets::mtcars %>%
correlate() %>% # Create correlation data frame (cor_df)
focus(-cyl, -vs, mirror = TRUE) %>% # Focus on cor_df without 'cyl' and 'vs'
rearrange() %>% # rearrange by correlations
shave() # Shave off the upper triangle for a clean result
fashion(x)
rplot(x)
datasets::airquality %>%
correlate() %>%
network_plot(min_cor = .2)
```