forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
22-de-real.Rmd
249 lines (215 loc) · 6.84 KB
/
22-de-real.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
# knit: bookdown::preview_chapter
output: html_document
---
# DE in a real dataset
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center")
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
library(scRNA.seq.funcs)
library(DESeq2)
library(scde)
library(ROCR)
library(limma)
set.seed(1)
```
## Introduction
The main advantage of using synthetic data is that we have full
control over all aspects of the data, and this facilitates the
interpretation of the results. However, the transcriptional bursting
model is unable to capture the full complexity of a real scRNA-seq
dataset. Next, we are going to analyze the difference between the
transcriptomes of the two stages of the mouse embryo development (_mid2cell_ and _16cell_) again from the `deng`. However, this time we need to use the raw counts instead of the normalized counts. This is a major requirement for the DE analysis tools like `DESeq2`, `edgeR`, `scde` etc.
```{r de-real-deng-fpkm}
deng <- readRDS("deng/deng_raw.rds")
# select cells at different stages
deng <- deng[ , c(which(colnames(deng) == "mid2cell"),
which(colnames(deng) == "16cell")[1:12])]
# keep those genes that are expressed in at least 6 cells
deng <- deng[rowSums(deng > 0) > 5, ]
pheatmap::pheatmap(
log2(deng + 1),
cutree_cols = 2,
kmeans_k = 100,
show_rownames = FALSE
)
```
As you can see, the cells cluster well by their developmental stage.
We can now use the same methods as before to obtain a list of
differentially expressed genes.
Because `scde` is very slow here we will only use a subset of genes. You should not do that with your real dataset, though. Here we do it just for demostration purposes:
```{r}
deng <- deng[sample(1:nrow(deng), 1000), ]
```
## KS-test
```{r, message=FALSE, warning=FALSE}
pVals <- rep(1, nrow(deng))
for (i in 1:nrow(deng)) {
res <- ks.test(
deng[i, 1:12],
deng[i, 13:24]
)
pVals[i] <- res$p.value
}
# Bonferroni correction
pVals <- p.adjust(pVals, method = "bonferroni")
```
## DESeq2
```{r, message=FALSE, warning=FALSE}
cond <- factor(
c(
rep("mid2cell", 12),
rep("16cell", 12)
)
)
colnames(deng) <- 1:ncol(deng)
dds <- DESeq2::DESeqDataSetFromMatrix(
deng,
colData = DataFrame(cond),
design = ~ cond
)
dds <- DESeq2::DESeq(dds)
resDESeq <- DESeq2::results(dds)
pValsDESeq <- resDESeq$padj
```
## SCDE
```{r}
cnts <- apply(
deng,
2,
function(x) {
storage.mode(x) <- 'integer'
return(x)
}
)
names(cond) <- 1:length(cond)
colnames(cnts) <- 1:length(cond)
o.ifm <- scde::scde.error.models(
counts = cnts,
groups = cond,
n.cores = 1,
threshold.segmentation = TRUE,
save.crossfit.plots = FALSE,
save.model.plots = FALSE,
verbose = 0,
min.size.entries = 2
)
priors <- scde::scde.expression.prior(
models = o.ifm,
counts = cnts,
length.out = 400,
show.plot = FALSE
)
resSCDE <- scde::scde.expression.difference(
o.ifm,
cnts,
priors,
groups = cond,
n.randomizations = 100,
n.cores = 1,
verbose = 0
)
# Convert Z-scores into 2-tailed p-values
pValsSCDE <- pnorm(abs(resSCDE$cZ), lower.tail = FALSE) * 2
pValsSCDE <- p.adjust(pValsSCDE, method = "bonferroni")
```
## Comparison of the methods
```{r de-real-comparison}
vennDiagram(
vennCounts(
cbind(
pVals < 0.05,
pValsDESeq < 0.05,
pValsSCDE < 0.05
)
),
names = c("KS-test", "DESeq2", "SCDE"),
circle.col = c("red", "blue", "green")
)
```
__Exercise 1:__ How does this Venn diagram correspond to what you would expect based on the synthetic data?
## Visualisation
To further characterize the list of genes, we can calculate the
average fold-changes and compare the ones that were called as
differentially expressed to the ones that were not.
```{r de-real-deng-hist, fig.height=7}
group1 <- deng[, 1:12] # mid2cell stage
group2 <- deng[, 13:24] # 16cell stage
ksGenesChangedInds <- which(pVals<.05)
deSeqGenesChangedInds <- which(pValsDESeq<.05)
scdeGenesChangedInds <- which(pValsSCDE<.05)
ksGenesNotChangedInds <- which(pVals>=.05)
deSeqGenesNotChangedInds <- which(pValsDESeq>=.05)
scdeGenesNotChangedInds <- which(pValsSCDE>=.05)
meanFoldChange <- rowSums(group1)/rowSums(group2)
par(mfrow=c(2,1))
hist(log2(meanFoldChange[ksGenesChangedInds]),
freq = FALSE,
xlab = "# fold-change",
col = rgb(1, 0, 0, 1/4))
hist(log2(meanFoldChange[deSeqGenesChangedInds]),
freq = FALSE,
xlab = "# fold-change",
col = rgb(0, 0, 1, 1/4))
```
__Exercise 2:__ Create the histogram of fold-changes for SCDE. Compare
the estimated fold-changes between the different methods? What do the
genes where they differ have in common?
### Volcano plots
A popular method for illustrating the difference between two
conditions is the volcano plot which compares the magnitude of the
change and the significance.
```{r de-real-deng-volcano, fig.height=7}
par(mfrow=c(2,1))
plot(log2(meanFoldChange),
-log10(pVals),
xlab = "mean expression change",
ylab = "-log10(P-value), KS-test")
points(log2(meanFoldChange[ksGenesChangedInds]),
-log10(pVals[ksGenesChangedInds]),
col = "red")
plot(log2(meanFoldChange),
-log10(pValsDESeq),
xlab = "mean expression change",
ylab = "-log10(P-value), DESeq2")
points(log2(meanFoldChange[deSeqGenesChangedInds]),
-log10(pValsDESeq[deSeqGenesChangedInds]),
col = "blue")
```
### MA-plot
Another popular method for illustrating the difference between two
conditions is the MA-plot, in which the data has been transformed onto the M (log ratios) and A (mean average) scale:
```{r de-real-deng-ma-plot, fig.height=7}
par(mfrow=c(2,1))
plot(log2(rowMeans(group1)),
log2(meanFoldChange),
ylab = "mean fold change",
xlab = "mean expression")
points(log2(rowMeans(group1[ksGenesChangedInds,])),
log2(meanFoldChange[ksGenesChangedInds]),
col = "red")
plot(log2(rowMeans(group1)),
log2(meanFoldChange),
ylab = "mean fold change",
xlab = "mean expression")
points(log2(rowMeans(group1[deSeqGenesChangedInds,])),
log2(meanFoldChange[deSeqGenesChangedInds]),
col = "blue")
```
__Exercise 3:__ The volcano and MA-plots for the SCDE are missing - can
you generate them? Compare to the synthetic data, what do they tell
you about the properties of the genes that have changed?
### Heatmap of DE genes
Finally, we can plot heatmaps of the genes that were called as DE by
the intersection of the three.
```{r de-real-deng-heatmap}
allChangedInds <- intersect(which(pValsDESeq<.05),
intersect(which(pValsSCDE<.05),
which(pVals<.05)))
pheatmap::pheatmap(log2(1 + deng[allChangedInds,]),
cutree_cols = 2,
show_rownames = FALSE)
```
__Exercise 4:__ Create heatmaps for the genes that were detected by at least 2/3 methods.