README.Rmd

---
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(ggplot2)
devtools::load_all()
```

[![Build Status](https://travis-ci.org/PavlidisLab/markerGeneProfile.svg?branch=master)](https://travis-ci.org/PavlidisLab/markerGeneProfile)[![codecov](https://codecov.io/gh/PavlidisLab/markerGeneProfile/branch/master/graph/badge.svg)](https://codecov.io/gh/PavlidisLab/markerGeneProfile)


markerGeneProfile
==================
This package includes functions responsible for marker gene selection and marker gene profile estimation estimation as described in [Mancarci et al. 2017](http://www.eneuro.org/content/4/6/ENEURO.0212-17.2017). It also includes a copy of mouse brain cell type markers from the [neuroExpressoAnalysis](https://github.com/oganm/neuroExpressoAnalysis) package for convenience along with mock data for easy testing.

   * [markerGeneProfile](#markergeneprofile)
   * [Table of Contents](#table-of-contents)
   * [Installation](#installation)
   * [Usage](#usage)
      * [Marker genes](#marker-genes)
         * [Sample data for marker gene selection](#sample-data-for-marker-gene-selection)
         * [Selection of marker genes](#selection-of-marker-genes)
         * [Better selection of marker genes](#better-selection-of-marker-genes)
      * [Marker gene profiles (MGP)](#marker-gene-profiles-mgp)
         * [Sample data for MGP estimation](#sample-data-for-mgp-estimation)
         * [MGP estimation](#mgp-estimation)
         * [Note about homologene](#note-about-homologene)

Installation
=============

Use devtools to install. Additional github packages needs to be installed for it work.
```
devtools::install_github('oganm/markerGeneProfile')
```

In this document additional packages are used that are not package dependencies

```
install.packages('ggplot2')
install.packages('gplots')
install.packages('viridis')
install.packages('dplyr')
install.packages('knitr')
```

Usage
=============

Marker genes
------------------
A list of marker genes specific to or enriched in a cell type is required in order to estimate cell type profiles. In this package, a copy of mouse brain cell type-specific markers from [neuroExpressoAnalysis](https://github.com/oganm/neuroExpressoAnalysis), the package that summarizes the entire analysis performed in Mancarci et al. 2017 is included (`mouseMarkerGenes`). If an different marker gene set is needed, steps outlined below can be followed to create one from a new dataset.

### Sample data for marker gene selection

This package includes a sample cell type-specific transcriptomic dataset representing expression profiles from multiple purified cell types aimed to demonstrate the minimal information required for the selection of marker genes.

`mgp_sampleProfilesMeta` includes the basic metadata required for the cell type specific expression dataset.

```{r}
data(mgp_sampleProfilesMeta)
knitr::kable(head(mgp_sampleProfilesMeta))
```

**sampleName:** name of the samples. This needs to correspond to column names in the expression file.

**replicate:** A vector marking which samples are replicates of each other.

**PMID: ** A vector marking which samples come from the same study. Normally taking PMIDs of the papers is a good idea.

**CellType:** A vector marking the cell types that the samples represent.

**region:** The regions samples are extracted from. Only needed if region specific genes are to be selected.

**RegionToParent:** If region specific genes are to be selected and a region hierarchy is to be used, this column controls whether or not the sample should be included in the parent regions of the indicated region. If not provided it will default to `TRUE`. The name of this column is hard coded and should not be changed.

**RegionToChildren:** Same as above except it controls if the sample should be included in the children regions. If not provided it will default to `TRUE`. The name of this column is hard coded and should not be changed.

`mgp_sampleProfiles` is a sample expression data. **Gene.Symbol** column is the gene identifier that should be composed of unique IDs while the rest are sample names that corresponds to the relevant column in the metadata file. Other columns can be present before the sample data but they should not be of class `double`.

```{r}
data(mgp_sampleProfiles)
knitr::kable(mgp_sampleProfiles)
```

`mpg_sampleRegionHiearchy` is a sample region hiearchy. It is a nested named list.

```{r}
data(mpg_sampleRegionHiearchy)
mpg_sampleRegionHiearchy

nametree(mpg_sampleRegionHiearchy)
```
In this example `Region 1` and `Region 2` are subsets of `All` region

### Selection of marker genes

Marker gene selection is performed using three functions: `markerCandidates`, `pickMarkers` and `rotateSelect`. By default `markerCandidates` will return files for each cell type in a region that lists the gene that are above a given silhouette and fold change thresholds. Other variables are briefly explained below but see the package documentation for in depth explanations

```{r}
markerCandidates(design = mgp_sampleProfilesMeta, # the design file
                 expression = mgp_sampleProfiles, # expression file 
                 outLoc = 'README_files/quickSelection', # output directory
                 groupNames = 'CellType', # name of the column with cell types. can be a vector
                 regionNames = 'region', # name of the column with brain regions. leave NULL if no region seperation is desired
                 PMID = 'PMID', # name of the column with study identifiers
                 sampleName = 'sampleName', # name of the column with sample names
                 replicates = 'replicate', # name of the column with replicates
                 foldChangeThresh = 10, # threshold of fold change for gene selection (default is 10)
                 minimumExpression = 8, # minimum expression level that a gene can be considered a marker gene (default is 8)
                 background = 6, # background level of expression (default is 6)
                 regionHierarchy = mpg_sampleRegionHiearchy, # hierarchy of brain regions to be used
                 geneID = 'Gene.Symbol', # column name with with gene idenditifers
                 cores = 8 # number of cores to use in parallelization 
                 )
```
This creates 3 directories in the output directory 
```{r}
list.files('README_files/quickSelection')
```
The `CellType` directory is a list of marker genes that disregards all region specifications (redundant with `All_CellType` in this case) while `Region 2_CellType` and `All_CellType` directories inlcude cell types from the relevant region. Note the absence of `Region 1_CellType` since that region only has a single cell type. 

```{r}
read.table('README_files/quickSelection/All_CellType/Cell C') %>% knitr::kable()
```

This file shows the candidate genes for cell type `Cell C` in region `All`. The first column is the gene identifier, the second is change in expression in log_2 scale and the third one is the silhouette coefficient. Note that `Gene6` is absent since its expression level was below the minimum level allowed. `markerCandidates` function does not apply a threshold for silhouette coefficient it also doesn't check to see if a gene satisfies fold change threshold for multiple genes. `pickMarkers` function does that.

```{r}
pickMarkers('README_files/quickSelection/All_CellType/',
            foldChange = 1,  # this is a fixed fold change threshold that ignores some leniency that comes from markerCandidates. setting it to 1 makes it irrelevant
            silhouette = 0.5)
```
If all genes for all regions needs to be seen
```{r}
pickMarkersAll('README_files/quickSelection',
               foldChange = 1,
               silhouette = 0.5)

```


### Better selection of marker genes

The above method is a quick way to pick markers but it does not handle bimodality in expression distribution well. To ensure robustness of the results it is better to perform multiple selections with permutations. `markerCandidates` function has variables to handle permutations for you. `rotate` controls what is the percentage of samples that should be removed every time. seed controls the random seed and is there to ensure reproducibility.

```{r, results = 'hide'}
for (i in 1:10){
    markerCandidates(design = mgp_sampleProfilesMeta, # the design file
                     expression = mgp_sampleProfiles, # expression file 
                     outLoc = file.path('README_files/Rotation',i), # output directory
                     groupNames = 'CellType', # name of the column with cell types. can be a vector
                     regionNames = 'region', # name of the column with brain regions. leave NULL if no region seperation is desired
                     PMID = 'PMID', # name of the column with study identifiers
                     sampleName = 'sampleName', # name of the column with sample names
                     replicates = 'replicate', # name of the column with replicates
                     foldChangeThresh = 10, # threshold of fold change for gene selection (default is 10)
                     minimumExpression = 8, # minimum expression level that a gene can be considered a marker gene (default is 8)
                     background = 6, # background level of expression (default is 6)
                     regionHierarchy = mpg_sampleRegionHiearchy, # hierarchy of brain regions to be used
                     geneID = 'Gene.Symbol', # column name with with gene idenditifers
                     cores = 8, # number of cores to use in parallelization 
                     rotate = 0.33,
                     seed = i
    )
}
```
This creates multiple selection directories. `rotateSelect` can be used to count the number of times a gene is selected for each cell type in each region. This creates another directory similar to the output of `markerCandidates`. Again, valid markers can be acquired using `pickMarkers`

```{r, results= 'hide'}
rotateSelect(rotationOut='README_files/Rotation',
                 rotSelOut='README_files/RotSel',
                 cores = 8,
                 foldChange = 1 # this is a fixed fold change threshold that ignores some leniency that comes from markerCandidates. setting it to 1 makes it irrelevant
             )
```

```{r}
pickMarkers('README_files/RotSel/All_CellType/',rotationThresh = 0.95)
pickMarkersAll('README_files/RotSel',rotationThresh = 0.95)
```

Marker gene profiles (MGP)
---------------------------

###Sample data for MGP estimation


The package includes mouse brain cell type markers published in Mancarci et al. 2017 and gene expression data from substantia nigra samples from healthy donors and Parkinson's disease patients by [Lesnick et al. 2007](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7621).

Mouse marker genes is available in `mouseMarkerGenes` object as a nested list.
```{r}
data(mouseMarkerGenes)
names(mouseMarkerGenes)

lapply(mouseMarkerGenes$Midbrain[1:3],head, 14)
```

Available Lesnick et al. data is stored in `mgp_LesnickParkinsonsExp` and `mgp_LesnickParkinsonsMeta` objects

```{r}
library(dplyr)

data(mgp_LesnickParkinsonsExp)
mgp_LesnickParkinsonsExp %>%
    dplyr::select(-GeneNames) %>%
    head %>% {.[,1:6]}
```


```{r}
data(mgp_LesnickParkinsonsMeta)
mgp_LesnickParkinsonsMeta %>% head
```


Before MGP estimation, it is important to filter expression data of low expressed genes and make sure all genes are represented only once in the dataset. While there are many probeset summarization and methods, for this work we chose the most variable probeset and remove all probes with a median expression below the median expression of the dataset.

```{r}
unfilteredParkinsonsExp = mgp_LesnickParkinsonsExp # keep this for later
medExp = mgp_LesnickParkinsonsExp %>% 
    sepExpr() %>% {.[[2]]} %>%
    unlist %>% median

# mostVariable function is part of this package that does probe selection and filtering for you
mgp_LesnickParkinsonsExp = mostVariable(mgp_LesnickParkinsonsExp,
                                        threshold = medExp, 
                                        threshFun= median)


```
### MGP estimation

MGPs are simply the first principal component of marker gene expression. This method of marker
gene profile estimation is similar to the methodology of multiple previous works that aim to estimate relative
abundance of cell types in a whole tissue sample (Chikina et al., 2015; Westra et al., 2015; Xu et al., 2013).

Primary function that deals with MGP estimation is `mgpEstimate`. This function will take in expression data and marker gene lists to return MGPs. `exprData` is the expression matrix which should include gene identifiers as a column. Other columns can be present at the beginning of the data frame but should not be of class `double`. `genes` is the list of markers. It is assumed to be a list of character vectors, each vector containing gene names for a cell type. `geneColName` is the name of the column where gene names are found. `geneTransform` is a function that will be applied to the gene names provided in `genes`. Note that this by default tranforms mouse gene names to human gene names. Set it to NULL if this is not desired.

Below a basic estimation performed with other important variables briefly explained.

```{r}
estimations =  mgpEstimate(exprData=mgp_LesnickParkinsonsExp,
                           genes=mouseMarkerGenes$Midbrain,
                           geneColName='Gene.Symbol',
                           outlierSampleRemove=F, # should outlier samples removed. This is done using boxplot stats.
                           geneTransform =function(x){homologene::mouse2human(x)$humanGene}, # this is the default option for geneTransform
                           groups=mgp_LesnickParkinsonsMeta$disease, #if there are experimental groups provide them here. if not desired set to NULL
                           seekConsensus = FALSE, # ensures gene rotations are positive in both of the groups
                           removeMinority = TRUE) # removes genes if they are the minority in terms of rotation sign from estimation process
```

Dopamine
rgic cell loss is a known effect of Parkinson's Disease. To see if this effect can be observed we can look at dopaminergic MGPs in healthy donors vs Parkinson's Disease patients

```{r}
library(ggplot2)

ls(estimations)

ls(estimations$estimates)

dopaminergicFrame =
    data.frame(`Dopaminergic MGP` = estimations$estimates$Dopaminergic, 
               state = estimations$groups$Dopaminergic, # note that unless outlierSampleRemove is TRUE this will be always the same as the groups input
               check.names=FALSE)

ggplot2::ggplot(dopaminergicFrame, 
                aes(x = state, y = `Dopaminergic MGP`)) + 
    geom_boxvio() + geom_jitter(width = .05) # this is just a convenience function that outputs a list of ggplot elements.
```

To see if the difference between the two groups is statistically significant we use wilcoxon test (Mann-Whitney test). Wilcoxon-test is a non parametric tests that does not make assumptions about the distribution of the data. We chose to use it because marker gene profiles are not normally distributed.

```{r}

group1 = estimations$estimates$Dopaminergic[estimations$groups$Dopaminergic %in% "Control"]
group2 = estimations$estimates$Dopaminergic[estimations$groups$Dopaminergic %in% "PD"]
wilcox.test(group1,group2)
```

Based on these results we can say that there is indeed a significant difference between doparminergic marker gene profiles between control and parkinson's disease patients.

This difference can be also observed by looking at gene expression of markers

```{r,fig.height=8,fig.width=10}
estimations$usedMarkerExpression$Dopaminergic%>%
    as.matrix %>%
    gplots::heatmap.2(trace = 'none',
                      scale='row',Rowv = FALSE,Colv = FALSE, dendrogram = 'none',
                      col= viridis::viridis(10),cexRow = 1.5, cexCol = 1,
                      ColSideColors = estimations$groups$Dopaminergic %>% 
                          toColor(palette = c('Control' = 'blue',
                                                     "PD" = "red")) %$% cols ,
                      margins = c(7,8))
```

Indeed in general most marker genes have a high expression is control samples which are shown in green.

It is important to note that not all marker genes are used in the estimation of marker gene profiles. Below we see all genes that are identified
as dopaminergic cell markers with human orthologues. Yet not all these genes appear in the heatmap above
```{r}
mouseHumanGeneTable = mouseMarkerGenes$Midbrain$Dopaminergic %>% homologene::mouse2human()
allHumanDopaGenes = mouseHumanGeneTable %$% humanGene
mouseHumanGeneTable
```

Some of these genes are removed because they are not included in our dataset, either because they are not in the platform, or because 
the gene was filtered in the pre-processing stage due to low expression.

```{r}
allHumanDopaGenes[!allHumanDopaGenes %in% mgp_LesnickParkinsonsExp$Gene.Symbol]

allGenes = allHumanDopaGenes[allHumanDopaGenes %in% mgp_LesnickParkinsonsExp$Gene.Symbol]
```

Other genes can be removed because mgpEstimate thinks they don't correlate well with the rest of the genes. Details of this process can be found in Mancarci et al. 2017 manuscript

```{r}
allGenes[!allGenes %in% rownames(estimations$usedMarkerExpression$Dopaminergic)]

genesUsed =  rownames(estimations$usedMarkerExpression$Dopaminergic)
```

By looking at the unfiltered expression data, we can see how these two genes were unsuitable for MGP estimation

```{r,fig.height=8,fig.width=12}
toPlot = 
    unfilteredParkinsonsExp[unfilteredParkinsonsExp$Gene.Symbol %in%
                                     homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene,] %>%
    mostVariable(threshold = 0) 

toPlot %>%
    mostVariable(threshold = 0) %>% 
    sepExpr() %>%
    {.[[2]]} %>% as.matrix() %>% 
    apply(1,function(x){scale01(x)}) %>%
    t %>%
    {rownames(.) = 
        toPlot$Gene.Symbol[toPlot$Gene.Symbol %in% 
                               homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene];.} %>%
    reshape2::melt() %>% 
    {colnames(.) = c('Gene','Sample','Expression');.} %>% 
    dplyr::mutate(`Is used?` = rep('used',length(Gene)) %>%
                      {.[Gene %in% 'CHRNA6'] = '(CHRNA6) not expressed';.[Gene %in% 'PRKCG'] = '(PRKCG) not correlated';.}) %>% 
    ggplot(aes(y = Expression, x = Sample, group = Gene, color = `Is used?`)) + 
    geom_line() + 
    cowplot::theme_cowplot() +
    theme( axis.text.x= element_blank(),
           axis.title = element_text(size = 20),
           legend.text = element_text(size = 17),
           legend.title = element_text(size=22),
           plot.title = element_text(size = 20)
           ) + 
    ggtitle('Scaled expression of markers')

toPlot %>%
    mostVariable(threshold = 0) %>% 
    sepExpr() %>%
    {.[[2]]} %>% as.matrix()%>%
    {rownames(.) = 
        toPlot$Gene.Symbol[toPlot$Gene.Symbol %in% 
                               homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene];.} %>%
    reshape2::melt() %>% 
    {colnames(.) = c('Gene','Sample','Expression');.} %>% 
    dplyr::mutate(`Is used?` = rep('used',length(Gene)) %>%
                      {.[Gene %in% 'CHRNA6'] = 'CHRNA6 - not expressed';.[Gene %in% 'PRKCG'] = 'PRKCG - not correlated';.}) %>% 
    ggplot(aes(y = Expression, x = Sample, group = Gene, color = `Is used?`)) + 
    geom_line() + 
    cowplot::theme_cowplot() +
       theme( axis.text.x= element_blank(),
           axis.title = element_text(size = 20),
           legend.text = element_text(size = 17),
           legend.title = element_text(size=22),
           plot.title = element_text(size = 20)
           )+ 
    ggtitle('Nonscaled expression of markers')


```

Both of these genes seem to be non-expressed though PRKCG managed to be just above the removal threshold. Luckily lack of correlation reveals that it is not behaving as other marker genes.

The ratio of `genesUsed` and `allGenes` (all markers available in the study) can be used as a confidence metric. If a significant portion of the genes do not correlate well with each other
that may point to presence of non cell type specific signal (regulation or noise). For all cell types this ratio is outputted. You'll see a warning if this 
ratio ever exceeds 0.4

```{r}
estimations$removedMarkerRatios
```

The ratio for dopaminergic cells is fairly low. We can also look at the amount of varience explained in the first PC of marker gene expression. If this value is low, it can point to higher amounts of confounding noise or regulation.

```{r}
estimations$trimmedPCAs$Dopaminergic %>% summary()
```

For instance unlike dopaminergic cells, cholinergic cells seem to have a higher proportion of their marker genes removed. Looking at the variation explained by first PC reveals that first PC only explains 28% of all variance. 
```{r}
estimations$trimmedPCAs$BrainstemCholin %>% summary()
```

Such a result can occur if

* Marker genes are highly regulated.
* There is little difference in cell type proportions or wholescale regulation of marker genes and MGP is driven by noise and other confounds.

### Note about homologene

In this tutorial homologene database is used through [homologene](https://cran.r-project.org/web/packages/homologene/index.html) package. Note that the default version of this database is somewhat old which means some annotations may have been changed. The `homologene` package also includes an updated version of the homologene package to make sure you get all the correct matches.

If we look at dopaminergic cell markers for instance we could have gotten an extra
gene (SCN2A) if we used the updated version.

```{r}
homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic) %$% humanGene

homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic,
                        db = homologene::homologeneData2) %$% humanGene

```