README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
```

# corrr <a href='https://corrr.tidymodels.org'><img src='man/figures/logo.png' align="right" height="139" /></a>

[![R build status](https://github.com/tidymodels/corrr/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/corrr/actions)
[![Build Status](https://travis-ci.org/tidymodels/corrr.svg?branch=master)](https://travis-ci.org/tidymodels/corrr)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/corrr)](https://cran.r-project.org/package=corrr)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/corrr/branch/master/graph/badge.svg)](https://codecov.io/gh/tidymodels/corrr?branch=master)

corrr is a package for exploring **corr**elations in **R**. It focuses on creating and working with **data frames** of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the [tidyverse](http://tidyverse.org/). This, along with the primary corrr functions, is represented below:

<img src='man/figures/to-cor-df.png'>

You can install:

- the latest released version from CRAN with

```{r install_cran, eval = FALSE}
# install.packages("corrr")
```

- the latest development version from GitHub with

```{r install_git, eval = FALSE}
# install.packages("remotes") 
# remotes::install_github("tidymodels/corrr")
```

## Using corrr

Using `corrr` typically starts with `correlate()`, which acts like the base correlation function `cor()`. It differs by defaulting to pairwise deletion, and returning a correlation data frame (`cor_df`) of the following structure:

- A `tbl` with an additional class, `cor_df`
- An extra "rowname" column
- Standardized variances (the matrix diagonal) set to missing values (`NA`) so they can be ignored.

### API

The corrr API is designed with data pipelines in mind (e.g., to use `%>%` from the magrittr package). After `correlate()`, the primary corrr functions take a `cor_df` as their first argument, and return a `cor_df` or `tbl` (or output like a plot). These functions serve one of three purposes:

Internal changes (`cor_df` out):

- `shave()` the upper or lower triangle (set to `r NA`).
- `rearrange()` the columns and rows based on correlation strengths.

Reshape structure (`tbl` or `cor_df` out):

- `focus()` on select columns and rows.
- `stretch()` into a long format.

Output/visualizations (console/plot out):

- `fashion()` the correlations for pretty printing.
- `rplot()` the correlations with shapes in place of the values.
- `network_plot()` the correlations in a network.

## Databases and Spark

The `correlate()` function also works with database tables.  The function will automatically push the calculations of the correlations to the database, collect the results in R, and return the `cor_df` object.  This allows for those results integrate with the rest of the `corrr` API.

## Examples

```{r example, message = FALSE, warning = FALSE}
library(MASS)
library(corrr)
set.seed(1)

# Simulate three columns correlating about .7 with each other
mu <- rep(0, 3)
Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3
seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Simulate three columns correlating about .4 with each other
mu <- rep(0, 3)
Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6
four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Bind together
d <- cbind(seven, four)
colnames(d) <- paste0("v", 1:ncol(d))

# Insert some missing values
d[sample(1:nrow(d), 100, replace = TRUE), 1] <- NA
d[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA

# Correlate
x <- correlate(d)
class(x)
x
```

As a `tbl`, we can use functions from data frame packages like `dplyr`, `tidyr`, `ggplot2`:

```{r, message = FALSE, warning = FALSE}
library(dplyr)

# Filter rows by correlation size
x %>% filter(v1 > .6)
```

corrr functions work in pipelines (`cor_df` in; `cor_df` or `tbl` out):

```{r combination, warning = FALSE, fig.height = 4, fig.width = 5}
x <- datasets::mtcars %>%
       correlate() %>%    # Create correlation data frame (cor_df)
       focus(-cyl, -vs, mirror = TRUE) %>%  # Focus on cor_df without 'cyl' and 'vs'
       rearrange() %>%  # rearrange by correlations
       shave() # Shave off the upper triangle for a clean result
       
fashion(x)
rplot(x)

datasets::airquality %>% 
  correlate() %>% 
  network_plot(min_cor = .2)
```