-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
425 lines (323 loc) · 21.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
---
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(ggplot2)
devtools::load_all()
```
[](https://travis-ci.org/PavlidisLab/markerGeneProfile)[](https://codecov.io/gh/PavlidisLab/markerGeneProfile)
markerGeneProfile
==================
This package includes functions responsible for marker gene selection and marker gene profile estimation estimation as described in [Mancarci et al. 2017](http://www.eneuro.org/content/4/6/ENEURO.0212-17.2017). It also includes a copy of mouse brain cell type markers from the [neuroExpressoAnalysis](https://github.com/oganm/neuroExpressoAnalysis) package for convenience along with mock data for easy testing.
* [markerGeneProfile](#markergeneprofile)
* [Table of Contents](#table-of-contents)
* [Installation](#installation)
* [Usage](#usage)
* [Marker genes](#marker-genes)
* [Sample data for marker gene selection](#sample-data-for-marker-gene-selection)
* [Selection of marker genes](#selection-of-marker-genes)
* [Better selection of marker genes](#better-selection-of-marker-genes)
* [Marker gene profiles (MGP)](#marker-gene-profiles-mgp)
* [Sample data for MGP estimation](#sample-data-for-mgp-estimation)
* [MGP estimation](#mgp-estimation)
* [Note about homologene](#note-about-homologene)
Installation
=============
Use devtools to install. Additional github packages needs to be installed for it work.
```
devtools::install_github('oganm/markerGeneProfile')
```
In this document additional packages are used that are not package dependencies
```
install.packages('ggplot2')
install.packages('gplots')
install.packages('viridis')
install.packages('dplyr')
install.packages('knitr')
```
Usage
=============
Marker genes
------------------
A list of marker genes specific to or enriched in a cell type is required in order to estimate cell type profiles. In this package, a copy of mouse brain cell type-specific markers from [neuroExpressoAnalysis](https://github.com/oganm/neuroExpressoAnalysis), the package that summarizes the entire analysis performed in Mancarci et al. 2017 is included (`mouseMarkerGenes`). If an different marker gene set is needed, steps outlined below can be followed to create one from a new dataset.
### Sample data for marker gene selection
This package includes a sample cell type-specific transcriptomic dataset representing expression profiles from multiple purified cell types aimed to demonstrate the minimal information required for the selection of marker genes.
`mgp_sampleProfilesMeta` includes the basic metadata required for the cell type specific expression dataset.
```{r}
data(mgp_sampleProfilesMeta)
knitr::kable(head(mgp_sampleProfilesMeta))
```
**sampleName:** name of the samples. This needs to correspond to column names in the expression file.
**replicate:** A vector marking which samples are replicates of each other.
**PMID: ** A vector marking which samples come from the same study. Normally taking PMIDs of the papers is a good idea.
**CellType:** A vector marking the cell types that the samples represent.
**region:** The regions samples are extracted from. Only needed if region specific genes are to be selected.
**RegionToParent:** If region specific genes are to be selected and a region hierarchy is to be used, this column controls whether or not the sample should be included in the parent regions of the indicated region. If not provided it will default to `TRUE`. The name of this column is hard coded and should not be changed.
**RegionToChildren:** Same as above except it controls if the sample should be included in the children regions. If not provided it will default to `TRUE`. The name of this column is hard coded and should not be changed.
`mgp_sampleProfiles` is a sample expression data. **Gene.Symbol** column is the gene identifier that should be composed of unique IDs while the rest are sample names that corresponds to the relevant column in the metadata file. Other columns can be present before the sample data but they should not be of class `double`.
```{r}
data(mgp_sampleProfiles)
knitr::kable(mgp_sampleProfiles)
```
`mpg_sampleRegionHiearchy` is a sample region hiearchy. It is a nested named list.
```{r}
data(mpg_sampleRegionHiearchy)
mpg_sampleRegionHiearchy
nametree(mpg_sampleRegionHiearchy)
```
In this example `Region 1` and `Region 2` are subsets of `All` region
### Selection of marker genes
Marker gene selection is performed using three functions: `markerCandidates`, `pickMarkers` and `rotateSelect`. By default `markerCandidates` will return files for each cell type in a region that lists the gene that are above a given silhouette and fold change thresholds. Other variables are briefly explained below but see the package documentation for in depth explanations
```{r}
markerCandidates(design = mgp_sampleProfilesMeta, # the design file
expression = mgp_sampleProfiles, # expression file
outLoc = 'README_files/quickSelection', # output directory
groupNames = 'CellType', # name of the column with cell types. can be a vector
regionNames = 'region', # name of the column with brain regions. leave NULL if no region seperation is desired
PMID = 'PMID', # name of the column with study identifiers
sampleName = 'sampleName', # name of the column with sample names
replicates = 'replicate', # name of the column with replicates
foldChangeThresh = 10, # threshold of fold change for gene selection (default is 10)
minimumExpression = 8, # minimum expression level that a gene can be considered a marker gene (default is 8)
background = 6, # background level of expression (default is 6)
regionHierarchy = mpg_sampleRegionHiearchy, # hierarchy of brain regions to be used
geneID = 'Gene.Symbol', # column name with with gene idenditifers
cores = 8 # number of cores to use in parallelization
)
```
This creates 3 directories in the output directory
```{r}
list.files('README_files/quickSelection')
```
The `CellType` directory is a list of marker genes that disregards all region specifications (redundant with `All_CellType` in this case) while `Region 2_CellType` and `All_CellType` directories inlcude cell types from the relevant region. Note the absence of `Region 1_CellType` since that region only has a single cell type.
```{r}
read.table('README_files/quickSelection/All_CellType/Cell C') %>% knitr::kable()
```
This file shows the candidate genes for cell type `Cell C` in region `All`. The first column is the gene identifier, the second is change in expression in log_2 scale and the third one is the silhouette coefficient. Note that `Gene6` is absent since its expression level was below the minimum level allowed. `markerCandidates` function does not apply a threshold for silhouette coefficient it also doesn't check to see if a gene satisfies fold change threshold for multiple genes. `pickMarkers` function does that.
```{r}
pickMarkers('README_files/quickSelection/All_CellType/',
foldChange = 1, # this is a fixed fold change threshold that ignores some leniency that comes from markerCandidates. setting it to 1 makes it irrelevant
silhouette = 0.5)
```
If all genes for all regions needs to be seen
```{r}
pickMarkersAll('README_files/quickSelection',
foldChange = 1,
silhouette = 0.5)
```
### Better selection of marker genes
The above method is a quick way to pick markers but it does not handle bimodality in expression distribution well. To ensure robustness of the results it is better to perform multiple selections with permutations. `markerCandidates` function has variables to handle permutations for you. `rotate` controls what is the percentage of samples that should be removed every time. seed controls the random seed and is there to ensure reproducibility.
```{r, results = 'hide'}
for (i in 1:10){
markerCandidates(design = mgp_sampleProfilesMeta, # the design file
expression = mgp_sampleProfiles, # expression file
outLoc = file.path('README_files/Rotation',i), # output directory
groupNames = 'CellType', # name of the column with cell types. can be a vector
regionNames = 'region', # name of the column with brain regions. leave NULL if no region seperation is desired
PMID = 'PMID', # name of the column with study identifiers
sampleName = 'sampleName', # name of the column with sample names
replicates = 'replicate', # name of the column with replicates
foldChangeThresh = 10, # threshold of fold change for gene selection (default is 10)
minimumExpression = 8, # minimum expression level that a gene can be considered a marker gene (default is 8)
background = 6, # background level of expression (default is 6)
regionHierarchy = mpg_sampleRegionHiearchy, # hierarchy of brain regions to be used
geneID = 'Gene.Symbol', # column name with with gene idenditifers
cores = 8, # number of cores to use in parallelization
rotate = 0.33,
seed = i
)
}
```
This creates multiple selection directories. `rotateSelect` can be used to count the number of times a gene is selected for each cell type in each region. This creates another directory similar to the output of `markerCandidates`. Again, valid markers can be acquired using `pickMarkers`
```{r, results= 'hide'}
rotateSelect(rotationOut='README_files/Rotation',
rotSelOut='README_files/RotSel',
cores = 8,
foldChange = 1 # this is a fixed fold change threshold that ignores some leniency that comes from markerCandidates. setting it to 1 makes it irrelevant
)
```
```{r}
pickMarkers('README_files/RotSel/All_CellType/',rotationThresh = 0.95)
pickMarkersAll('README_files/RotSel',rotationThresh = 0.95)
```
Marker gene profiles (MGP)
---------------------------
###Sample data for MGP estimation
The package includes mouse brain cell type markers published in Mancarci et al. 2017 and gene expression data from substantia nigra samples from healthy donors and Parkinson's disease patients by [Lesnick et al. 2007](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7621).
Mouse marker genes is available in `mouseMarkerGenes` object as a nested list.
```{r}
data(mouseMarkerGenes)
names(mouseMarkerGenes)
lapply(mouseMarkerGenes$Midbrain[1:3],head, 14)
```
Available Lesnick et al. data is stored in `mgp_LesnickParkinsonsExp` and `mgp_LesnickParkinsonsMeta` objects
```{r}
library(dplyr)
data(mgp_LesnickParkinsonsExp)
mgp_LesnickParkinsonsExp %>%
dplyr::select(-GeneNames) %>%
head %>% {.[,1:6]}
```
```{r}
data(mgp_LesnickParkinsonsMeta)
mgp_LesnickParkinsonsMeta %>% head
```
Before MGP estimation, it is important to filter expression data of low expressed genes and make sure all genes are represented only once in the dataset. While there are many probeset summarization and methods, for this work we chose the most variable probeset and remove all probes with a median expression below the median expression of the dataset.
```{r}
unfilteredParkinsonsExp = mgp_LesnickParkinsonsExp # keep this for later
medExp = mgp_LesnickParkinsonsExp %>%
sepExpr() %>% {.[[2]]} %>%
unlist %>% median
# mostVariable function is part of this package that does probe selection and filtering for you
mgp_LesnickParkinsonsExp = mostVariable(mgp_LesnickParkinsonsExp,
threshold = medExp,
threshFun= median)
```
### MGP estimation
MGPs are simply the first principal component of marker gene expression. This method of marker
gene profile estimation is similar to the methodology of multiple previous works that aim to estimate relative
abundance of cell types in a whole tissue sample (Chikina et al., 2015; Westra et al., 2015; Xu et al., 2013).
Primary function that deals with MGP estimation is `mgpEstimate`. This function will take in expression data and marker gene lists to return MGPs. `exprData` is the expression matrix which should include gene identifiers as a column. Other columns can be present at the beginning of the data frame but should not be of class `double`. `genes` is the list of markers. It is assumed to be a list of character vectors, each vector containing gene names for a cell type. `geneColName` is the name of the column where gene names are found. `geneTransform` is a function that will be applied to the gene names provided in `genes`. Note that this by default tranforms mouse gene names to human gene names. Set it to NULL if this is not desired.
Below a basic estimation performed with other important variables briefly explained.
```{r}
estimations = mgpEstimate(exprData=mgp_LesnickParkinsonsExp,
genes=mouseMarkerGenes$Midbrain,
geneColName='Gene.Symbol',
outlierSampleRemove=F, # should outlier samples removed. This is done using boxplot stats.
geneTransform =function(x){homologene::mouse2human(x)$humanGene}, # this is the default option for geneTransform
groups=mgp_LesnickParkinsonsMeta$disease, #if there are experimental groups provide them here. if not desired set to NULL
seekConsensus = FALSE, # ensures gene rotations are positive in both of the groups
removeMinority = TRUE) # removes genes if they are the minority in terms of rotation sign from estimation process
```
Dopamine
rgic cell loss is a known effect of Parkinson's Disease. To see if this effect can be observed we can look at dopaminergic MGPs in healthy donors vs Parkinson's Disease patients
```{r}
library(ggplot2)
ls(estimations)
ls(estimations$estimates)
dopaminergicFrame =
data.frame(`Dopaminergic MGP` = estimations$estimates$Dopaminergic,
state = estimations$groups$Dopaminergic, # note that unless outlierSampleRemove is TRUE this will be always the same as the groups input
check.names=FALSE)
ggplot2::ggplot(dopaminergicFrame,
aes(x = state, y = `Dopaminergic MGP`)) +
geom_boxvio() + geom_jitter(width = .05) # this is just a convenience function that outputs a list of ggplot elements.
```
To see if the difference between the two groups is statistically significant we use wilcoxon test (Mann-Whitney test). Wilcoxon-test is a non parametric tests that does not make assumptions about the distribution of the data. We chose to use it because marker gene profiles are not normally distributed.
```{r}
group1 = estimations$estimates$Dopaminergic[estimations$groups$Dopaminergic %in% "Control"]
group2 = estimations$estimates$Dopaminergic[estimations$groups$Dopaminergic %in% "PD"]
wilcox.test(group1,group2)
```
Based on these results we can say that there is indeed a significant difference between doparminergic marker gene profiles between control and parkinson's disease patients.
This difference can be also observed by looking at gene expression of markers
```{r,fig.height=8,fig.width=10}
estimations$usedMarkerExpression$Dopaminergic%>%
as.matrix %>%
gplots::heatmap.2(trace = 'none',
scale='row',Rowv = FALSE,Colv = FALSE, dendrogram = 'none',
col= viridis::viridis(10),cexRow = 1.5, cexCol = 1,
ColSideColors = estimations$groups$Dopaminergic %>%
toColor(palette = c('Control' = 'blue',
"PD" = "red")) %$% cols ,
margins = c(7,8))
```
Indeed in general most marker genes have a high expression is control samples which are shown in green.
It is important to note that not all marker genes are used in the estimation of marker gene profiles. Below we see all genes that are identified
as dopaminergic cell markers with human orthologues. Yet not all these genes appear in the heatmap above
```{r}
mouseHumanGeneTable = mouseMarkerGenes$Midbrain$Dopaminergic %>% homologene::mouse2human()
allHumanDopaGenes = mouseHumanGeneTable %$% humanGene
mouseHumanGeneTable
```
Some of these genes are removed because they are not included in our dataset, either because they are not in the platform, or because
the gene was filtered in the pre-processing stage due to low expression.
```{r}
allHumanDopaGenes[!allHumanDopaGenes %in% mgp_LesnickParkinsonsExp$Gene.Symbol]
allGenes = allHumanDopaGenes[allHumanDopaGenes %in% mgp_LesnickParkinsonsExp$Gene.Symbol]
```
Other genes can be removed because mgpEstimate thinks they don't correlate well with the rest of the genes. Details of this process can be found in Mancarci et al. 2017 manuscript
```{r}
allGenes[!allGenes %in% rownames(estimations$usedMarkerExpression$Dopaminergic)]
genesUsed = rownames(estimations$usedMarkerExpression$Dopaminergic)
```
By looking at the unfiltered expression data, we can see how these two genes were unsuitable for MGP estimation
```{r,fig.height=8,fig.width=12}
toPlot =
unfilteredParkinsonsExp[unfilteredParkinsonsExp$Gene.Symbol %in%
homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene,] %>%
mostVariable(threshold = 0)
toPlot %>%
mostVariable(threshold = 0) %>%
sepExpr() %>%
{.[[2]]} %>% as.matrix() %>%
apply(1,function(x){scale01(x)}) %>%
t %>%
{rownames(.) =
toPlot$Gene.Symbol[toPlot$Gene.Symbol %in%
homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene];.} %>%
reshape2::melt() %>%
{colnames(.) = c('Gene','Sample','Expression');.} %>%
dplyr::mutate(`Is used?` = rep('used',length(Gene)) %>%
{.[Gene %in% 'CHRNA6'] = '(CHRNA6) not expressed';.[Gene %in% 'PRKCG'] = '(PRKCG) not correlated';.}) %>%
ggplot(aes(y = Expression, x = Sample, group = Gene, color = `Is used?`)) +
geom_line() +
cowplot::theme_cowplot() +
theme( axis.text.x= element_blank(),
axis.title = element_text(size = 20),
legend.text = element_text(size = 17),
legend.title = element_text(size=22),
plot.title = element_text(size = 20)
) +
ggtitle('Scaled expression of markers')
toPlot %>%
mostVariable(threshold = 0) %>%
sepExpr() %>%
{.[[2]]} %>% as.matrix()%>%
{rownames(.) =
toPlot$Gene.Symbol[toPlot$Gene.Symbol %in%
homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic)$humanGene];.} %>%
reshape2::melt() %>%
{colnames(.) = c('Gene','Sample','Expression');.} %>%
dplyr::mutate(`Is used?` = rep('used',length(Gene)) %>%
{.[Gene %in% 'CHRNA6'] = 'CHRNA6 - not expressed';.[Gene %in% 'PRKCG'] = 'PRKCG - not correlated';.}) %>%
ggplot(aes(y = Expression, x = Sample, group = Gene, color = `Is used?`)) +
geom_line() +
cowplot::theme_cowplot() +
theme( axis.text.x= element_blank(),
axis.title = element_text(size = 20),
legend.text = element_text(size = 17),
legend.title = element_text(size=22),
plot.title = element_text(size = 20)
)+
ggtitle('Nonscaled expression of markers')
```
Both of these genes seem to be non-expressed though PRKCG managed to be just above the removal threshold. Luckily lack of correlation reveals that it is not behaving as other marker genes.
The ratio of `genesUsed` and `allGenes` (all markers available in the study) can be used as a confidence metric. If a significant portion of the genes do not correlate well with each other
that may point to presence of non cell type specific signal (regulation or noise). For all cell types this ratio is outputted. You'll see a warning if this
ratio ever exceeds 0.4
```{r}
estimations$removedMarkerRatios
```
The ratio for dopaminergic cells is fairly low. We can also look at the amount of varience explained in the first PC of marker gene expression. If this value is low, it can point to higher amounts of confounding noise or regulation.
```{r}
estimations$trimmedPCAs$Dopaminergic %>% summary()
```
For instance unlike dopaminergic cells, cholinergic cells seem to have a higher proportion of their marker genes removed. Looking at the variation explained by first PC reveals that first PC only explains 28% of all variance.
```{r}
estimations$trimmedPCAs$BrainstemCholin %>% summary()
```
Such a result can occur if
* Marker genes are highly regulated.
* There is little difference in cell type proportions or wholescale regulation of marker genes and MGP is driven by noise and other confounds.
### Note about homologene
In this tutorial homologene database is used through [homologene](https://cran.r-project.org/web/packages/homologene/index.html) package. Note that the default version of this database is somewhat old which means some annotations may have been changed. The `homologene` package also includes an updated version of the homologene package to make sure you get all the correct matches.
If we look at dopaminergic cell markers for instance we could have gotten an extra
gene (SCN2A) if we used the updated version.
```{r}
homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic) %$% humanGene
homologene::mouse2human(mouseMarkerGenes$Midbrain$Dopaminergic,
db = homologene::homologeneData2) %$% humanGene
```