README.Rmd

---
title: "nanny: A tidyverse machine-learning suite"
output: github_document
---
**It tidies up your playground!**

#  <img src="inst/logo-02.png" height="139px" width="120px" />

<!---

[![Build Status](https://travis-ci.org/stemangiola/nanny.svg?branch=master)](https://travis-ci.org/stemangiola/nanny) [![Coverage Status](https://coveralls.io/repos/github/stemangiola/nanny/badge.svg?branch=master)](https://coveralls.io/github/stemangiola/nanny?branch=master)

-->

<!-- badges: start -->
  [![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
 <!-- badges: end -->

Please have a look also to 

- [tidygate](https://github.com/stemangiola/tidygate) for adding custom gate information to your tibble 
- [tidyHeatmap](https://github.com/stemangiola/tidyHeatmap) for heatmaps produced with tidy principles
- [tidybulk](https://github.com/stemangiola/tidybulk) brings transcriptomics to the tidyverse

## Functions/utilities available

It does a lot! cluster, PCA, permute, impute, rotate, redundancy-removal, triangular, smart-subset, identify abundant and variable features.

Function | Description
------------ | -------------
`reduce_dimensions` | Perform dimensionality reduction (PCA, MDS, tSNE)
`rotate_dimensions` | Rotate two dimensions of a degree
`cluster_elements` | Labels elements with cluster identity
`remove_redundancy` | Filter out elements with highly correlated features
`fill_missing` | Fill values of missing element/feature pairs 
`impute_missing` | Impute values of missing element/feature pairs 
`permute_nest` | From one column build a two permuted columns with nested information
`combine_nest` | From one column build a two combination columns with nested information
`keep_variable` | Keep top variable features
`lower_triangular` | keep rows corresponding to a lower triangular matrix

Utilities | Description
------------ | -------------
`as_matrix` | Robustly convert a tibble to matrix 
`subset`| Select columns with information relative to a column of interest

## Minimal input data frame

element | feature | value
------------ | ------------- | -------------
`chr` or `fctr` | `chr` or `fctr` | `numeric`

## Output data frame

element | feature | value | new information
------------ | ------------- | ------------- | -------------
`chr` or `fctr` | `chr` or `fctr` | `numeric` | ...


```{r, echo=FALSE, include=FALSE, }
library(knitr)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
                      message = FALSE, cache.lazy = FALSE)

library(tidyverse)
library(magrittr)
library(nanny)

my_theme = 	
	theme_bw() +
	theme(
		panel.border = element_blank(),
		axis.line = element_line(),
		panel.grid.major = element_line(size = 0.2),
		panel.grid.minor = element_line(size = 0.1),
		text = element_text(size=12),
		legend.position="bottom",
		aspect.ratio=1,
		strip.background = element_blank(),
		axis.title.x  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
		axis.title.y  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10))
	)

```

## Installation

```{r, eval=FALSE}

devtools::install_github("stemangiola/nanny")

```

## Introduction

nanny is a collection of wrapper functions for high level data analysis and manipulation following the tidy paradigm.

## Tidy data

```{r}
mtcars_tidy = 
	mtcars %>% 
	as_tibble(rownames="car_model") %>% 
	mutate_at(vars(-car_model,- hp, -vs), scale) %>%
	gather(feature, value, -car_model, -hp, -vs)

mtcars_tidy
```


## `reduce_dimensions`

We may want to reduce the dimensions of our data, for example using PCA,  MDS of tSNE algorithms. `reduce_dimensions` takes a tibble, column names (as symbols; for `element`, `feature` and `value`) and a method (e.g., MDS,  PCA or tSNE) as arguments and returns a tibble with additional columns for the reduced dimensions.

**MDS** 

```{r mds, cache=TRUE}
mtcars_tidy_MDS =
  mtcars_tidy %>%
  reduce_dimensions(car_model, feature, value, method="MDS", .dims = 3)

```

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

```{r plot_mds, cache=TRUE}
mtcars_tidy_MDS %>% subset(car_model)  %>% select(contains("Dim"), everything())

mtcars_tidy_MDS %>%
	subset(car_model) %>%
  GGally::ggpairs(columns = 4:6, ggplot2::aes(colour=factor(vs)))


```

**PCA**

```{r pca, cache=TRUE, message=FALSE, warning=FALSE, results='hide'}
mtcars_tidy_PCA =
  mtcars_tidy %>%
  reduce_dimensions(car_model, feature, value, method="PCA", .dims = 3)
```

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

```{r plot_pca, cache=TRUE}

mtcars_tidy_PCA %>% subset(car_model) %>% select(contains("PC"), everything())

mtcars_tidy_PCA %>%
	 subset(car_model) %>%
  GGally::ggpairs(columns = 4:6, ggplot2::aes(colour=factor(vs)))
```

**tSNE**

```{r tsne, cache=TRUE, message=FALSE, warning=FALSE, results='hide'}
mtcars_tidy_tSNE =
	mtcars_tidy %>% 
	reduce_dimensions(car_model, feature, value, method = "tSNE")
```

Plot

```{r}
mtcars_tidy_tSNE %>%
	subset(car_model) %>%
	select(contains("tSNE"), everything()) 

mtcars_tidy_tSNE %>%
	subset(car_model) %>%
	ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=factor(vs))) + geom_point() + my_theme
```

## `rotate_dimensions`

We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. `rotate_dimensions` takes a tibble, column names (as symbols; for `element`, `feature` and `value`) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as `<NAME OF DIMENSION> rotated <ANGLE>` by default, or as specified in the input arguments.

```{r rotate, cache=TRUE}
mtcars_tidy_MDS.rotated =
  mtcars_tidy_MDS %>%
	rotate_dimensions(`Dim1`, `Dim2`, .element = car_model, rotation_degrees = 45, action="get")
```

**Original**
On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type.

```{r plot_rotate_1, cache=TRUE}
mtcars_tidy_MDS.rotated %>%
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=factor(vs) )) +
  geom_point() +
  my_theme
```

**Rotated**
On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type.

```{r plot_rotate_2, cache=TRUE}
mtcars_tidy_MDS.rotated %>%
	ggplot(aes(x=`Dim1 rotated 45`, y=`Dim2 rotated 45`, color=factor(vs) )) +
  geom_point() +
  my_theme
```


## `cluster_elements`

We may want to cluster our data (e.g., using k-means element-wise). `cluster_elements` takes as arguments a tibble, column names (as symbols; for `element`, `feature` and `value`) and returns a tibble with additional columns for the cluster annotation. At the moment only k-means clustering is supported, the plan is to introduce more clustering methods.

**k-means**


```{r cluster, cache=TRUE}
mtcars_tidy_cluster = mtcars_tidy_MDS %>%
  cluster_elements(car_model, feature, value, method="kmeans",	centers = 2, action="get" )
```

We can add cluster annotation to the MDS dimension reduced data set and plot.

```{r plot_cluster, cache=TRUE}
 mtcars_tidy_cluster %>%
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=cluster_kmeans)) +
  geom_point() +
  my_theme
```

**SNN**

```{r SNN, cache=TRUE}
mtcars_tidy_SNN =
	mtcars_tidy_tSNE %>%
	cluster_elements(car_model, feature, value, method = "SNN")
```

We can add cluster annotation to the tSNE dimension reduced data set and plot.

```{r SNN_plot, cache=TRUE}
mtcars_tidy_SNN %>%
	subset(car_model) %>%
	select(contains("tSNE"), everything()) 

mtcars_tidy_SNN %>%
	subset(car_model) %>%
	ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=cluster_SNN)) + geom_point() + my_theme

```

**gating**

```{r eval=FALSE}
mtcars_tidy_MDS %>%
	cluster_elements(car_model, c(Dim1, Dim2), method="gate", .color=group)
```

![](inst/tidygate.gif)

```{r echo=FALSE}
mtcars_tidy_MDS %>%
	cluster_elements(car_model, c(Dim1, Dim2), method="gate", gate_list = gate_list) %>%
	arrange(gate %>% desc)
```

## `drop_redundant` 

We may want to remove redundant elements from the original data set (e.g., elements or features), for example if we want to define cell-type specific signatures with low element redundancy. `remove_redundancy` takes as arguments a tibble, column names (as symbols; for `element`, `feature` and `value`) and returns a tibble dropped recundant elements (e.g., elements). Two redundancy estimation approaches are supported:

removal of highly correlated clusters of elements (keeping a representative) with method="correlation"

```{r drop, cache=TRUE}
mtcars_tidy_non_redundant =
	mtcars_tidy_MDS %>%
  remove_redundancy(car_model, feature, value)
```

We can visualise how the reduced redundancy with the reduced dimentions look like

```{r plot_drop, cache=TRUE}
mtcars_tidy_non_redundant %>%
	subset(car_model) %>%
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=factor(vs))) +
  geom_point() +
  my_theme

```


```{r drop2, cache=TRUE, include=FALSE, eval=FALSE, echo=FALSE}
mtcars_tidy_non_redundant =
	mtcars_tidy_MDS %>%
  remove_redundancy(
  	car_model, feature, value, 
  	method = "reduced_dimensions",
  	Dim_a_column = `Dim1`,
  	Dim_b_column = `Dim2`
  )
```

```{r plot_drop2, cache=TRUE, include=FALSE, eval=FALSE, echo=FALSE}
mtcars_tidy_non_redundant %>%
	subset(car_model) %>%
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=factor(vs))) +
  geom_point() +
  my_theme

```


## `fill_missing` 

This function allows to obtain a rectangular underlying data structure, where every element has one feature, filling missing element/feature pairs with a value of choice (e.g., 0)

We create a non-rectangular data frame

```{r}
mtcars_tidy_non_rectangular = mtcars_tidy %>% slice(-1)
```

We fill the missing value with the value of 0

```{r}
mtcars_tidy_non_rectangular %>% fill_missing(car_model, feature, value, fill_with = 0)
```

## `impute_missing` 

This function allows to obtain a rectangular underlying data structure, where every element has one feature, imputig missing element/feature pairs with a function of choice (e.g., median)

We impute the missing value with the a summary value (median by default) according to a grouping

```{r}
mtcars_tidy_non_rectangular %>% mutate(vs = factor(vs)) %>% 
	impute_missing( car_model, feature, value,  ~ vs) %>%
	
	# Print imputed first
	arrange(car_model != "Mazda RX4" | feature != "mpg")
```

## `permute_nest`
From one column build a two permuted columns with nested information

```{r}
mtcars_tidy_permuted = 
	mtcars_tidy %>%
	permute_nest(car_model, c(feature,value))

mtcars_tidy_permuted
```

## `combine_nest`
From one column build a two combination columns with nested information

```{r}
mtcars_tidy %>%
	combine_nest(car_model, value)
```

## `lower_triangular` 
keep rows corresponding to a lower triangular matrix

```{r}
mtcars_tidy_permuted %>%
	
	# Summarise mpg
	mutate(data = map(data, ~ .x %>% filter(feature == "mpg") %>% summarise(mean(value)))) %>%
	unnest(data) %>%
	
	# Lower triangular
	lower_triangular(car_model_1, car_model_2,  `mean(value)`)
```

## `keep_variable`
Keep top variable features

```{r}
mtcars_tidy %>%
	keep_variable(car_model, feature, value, top=10)
```

## `as_matrix`
Robustly convert a tibble to matrix 

```{r}
mtcars_tidy %>%
	select(car_model, feature, value) %>%
	spread(feature, value) %>%
	as_matrix(rownames = car_model) %>%
	head()
```


## `subset` 
Select columns with information relative to a column of interest

```{r}
mtcars_tidy %>%
	subset(car_model)

```

## `nest_subset` 
Nest a data frame based on the columns with information relative to the column provided to nest

```{r}
mtcars_tidy %>% nest_subset(data = -car_model)

```

## `ADD` versus `GET` versus `ONLY` modes

Every function takes a tidyfeatureomics structured data as input, and (i) with action="add" outputs the new information joint to the original input data frame (default), (ii) with action="get" the new information with the element or feature relative informatin depending on what the analysis is about, or (iii) with action="only" just the new information. For example, from this data set

```{r, cache=TRUE}
  mtcars_tidy
```

**action="add"** (Default)
We can add the MDS dimensions to the original data set

```{r, cache=TRUE}
  mtcars_tidy %>%
    reduce_dimensions(
    	car_model, feature, value, 
    	method="MDS" ,
    	.dims = 3,
    	action="add"
    )
```

**action="get"** 
We can add the MDS dimensions to the original data set selecting just the element-wise column

```{r, cache=TRUE}
  mtcars_tidy %>%
    reduce_dimensions(
    	car_model, feature, value, 
    	method="MDS" ,
    	.dims = 3,
    	action="get"
    )
```

**action="only"**
We can get just the MDS dimensions relative to each element

```{r, cache=TRUE}
  mtcars_tidy %>%
    reduce_dimensions(
    	car_model, feature, value, 
    	method="MDS" ,
    	.dims = 3,
    	action="only"
    )
```