vignettes/Gene-Relevance.Rmd

---
title: detecting relevant genes with destiny 3
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{detecting relevant genes with destiny 3}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Single Cell RNA-Sequencing data and gene relevance
==================================================

Libraries
---------

We need of course destiny, scran for preprocessing, and some tidyverse niceties.

```{r}
library(conflicted)
library(destiny)
suppressPackageStartupMessages(library(scran))
library(purrr)
library(ggplot2)
library(SingleCellExperiment)
```

Data
----

Let’s use data from the `scRNAseq`[1] package. If necessary, install it via `BiocManager::install('scRNAseq')`.

[1] Risso D, Cole M (2019). [scRNAseq: A Collection of Public Single-Cell RNA-Seq Datasets](https://bioconductor.org/packages/scRNAseq/).

```{r}
# The parts of the help we’re interested in
help('scRNAseq-package', package = 'scRNAseq') %>%
    repr::repr_html() %>%
    stringr::str_extract_all(stringr::regex('<p>The dataset.*?</p>', dotall = TRUE)) %>%
    unlist() %>%
    paste(collapse = '') %>%
    knitr::raw_html()
```

379 cells seems sufficient to see something!

```{r}
allen <- scRNAseq::ReprocessedAllenData()
```

Preprocessing
-------------

We’ll mostly stick to the [scran vignette][] here. Let’s add basic information to the data and choose what to work with.

As `scran` expects the raw counts in the `counts` assay, we rename the more accurate RSEM counts to `counts`.
Our data has ERCC spike-ins in an `altExp` slot:

[scran vignette]: https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html

```{r}
rowData(allen)$Symbol <- rownames(allen)
rowData(allen)$EntrezID <- AnnotationDbi::mapIds(org.Mm.eg.db::org.Mm.eg.db, rownames(allen), 'ENTREZID', 'ALIAS')
rowData(allen)$Uniprot <- AnnotationDbi::mapIds(org.Mm.eg.db::org.Mm.eg.db, rownames(allen), 'UNIPROT', 'ALIAS', multiVals = 'list')
assayNames(allen)[assayNames(allen) == 'rsem_counts'] <- 'counts'
assayNames(altExp(allen, 'ERCC'))[assayNames(altExp(allen, 'ERCC')) == 'rsem_counts'] <- 'counts'
allen
```

Now we can use it to renormalize the data. We normalize the `counts` using the spike-in size factors and logarithmize them into `logcounts`.

```{r}
allen <- computeSpikeFactors(allen, 'ERCC')
allen <- logNormCounts(allen)
allen
```

We also use the spike-ins to detect highly variable genes more accurately:

```{r}
decomp <- modelGeneVarWithSpikes(allen, 'ERCC')
rowData(allen)$hvg_order <- order(decomp$bio, decreasing = TRUE)
```

We create a subset of the data containing only rasonably highly variable genes:

```{r}
allen_hvg <- subset(allen, hvg_order <= 5000L)
```

Let’s create a Diffusion map. For rapid results, people often create a PCA first, which can be stored in your `SingleCellExperiment` before creating the Diffusion map or simply created implicitly using `DiffusionMap(..., n_pcs = <number>)`.

However, even with many more principal components than necessary to get a nicely resolved Diffusion Map, the close spatial correspondence between diffusion components and genes are lost.

```{r}
#To go from PCA: reducedDim(allen_hvg, 'pca') <- irlba::prcomp_irlba(t(assay(allen, 'logcounts')), 50)$x
```

The chosen distance metric has big implications on your results, you should try at least cosine and rankcor.

```{r}
set.seed(1)
dms <- c('euclidean', 'cosine', 'rankcor') %>% #, 'l2'
    set_names() %>%
    map(~ DiffusionMap(allen_hvg, distance = ., knn_params = list(method = 'covertree')))
```

```{r, fig.asp = 1/4, fig.width = 10}
dms %>%
    imap(function(dm, dist) plot(dm, 1:2, col_by = 'driver_1_s') + ggtitle(dist)) %>%
    cowplot::plot_grid(plotlist = ., nrow = 1)
```

```{r}
grs <- map(dms, gene_relevance)
```

```{r, fig.asp = 1/4, fig.width = 10}
gms <- imap(grs, function(gr, dist) plot(gr, iter_smooth = 0) + ggtitle(dist))
cowplot::plot_grid(plotlist = gms, nrow = 1)
```

As you can see, despite the quite different embedding, the rankcor and Cosine diffusion Maps display a number of the same driving genes.

```{r}
gms[-1] %>% map(~ .$ids[1:10]) %>% purrr::reduce(base::intersect) %>% cat(sep = ' ')
```

```{r}
options(readr.show_col_types = FALSE)
tryCatch({
    httr::GET('https://rest.uniprot.org/uniprotkb/search', query = list(
        fields = 'accession,gene_names,cc_tissue_specificity',
        format = 'tsv',
        query = rowData(allen)$Uniprot[gms$cosine$ids[1:6]] %>% unlist() %>% paste(collapse = ' OR ')
    )) %>%
        httr::content(type = 'text/tab-separated-values', encoding = 'utf-8')
}, error = function(e) e)
```