-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy path08-MAW-PVII.Rmd
574 lines (387 loc) · 19.7 KB
/
08-MAW-PVII.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
---
title: "Differential abundance testing"
author: "Leo Lahti"
date: "`r Sys.Date()`"
output: bookdown::gitbook
site: bookdown::bookdown_site
---
```{r, echo=FALSE, message=FALSE, warning=FALSE, eval=FALSE}
# Handle citations
require(knitcitations)
require(bookdown)
require(ggplot2)
# cleanbib()
# options("citation_format" = "pandoc")
bib <- read.bibtex("bibliography.bib")
#opts_chunk$set(fig.width=4, fig.height=3, par=TRUE, out.width='2in', fig.pos='H')
library(knitr)
knitr::opts_chunk$set(fig.path = "figure/", dev="CairoPNG")
theme_set(theme_bw(20))
```
# Differential abundance testing for univariate data
This section covers basic univariate tests for two-group comparison,
covering t-test, Wilcoxon test, and multiple testing. You can try out the suggested exercises in the hands-on session. These are followed by example solutions which we will cover in more detail in the class.
## Load example data
The following example compares the abundance of a selected bug between
two conditions. We assume that the data is already properly
normalized.
```{r univariate1, warning=FALSE, message=FALSE}
library(microbiome)
theme_set(theme_bw(20))
data(dietswap)
d <- dietswap
# Pick microbial abundances for a given taxonomic group
taxa <- "Dialister"
# Construct a data.frame with the selected
# taxonomic group and grouping
df <- data.frame(Abundance = abundances(d)[taxa,],
Group = meta(d)$nationality)
library(knitr)
kable(head(df))
```
## Visual comparison of two groups
**Task: Compare the groups visually** Tips: boxplot, density plot, histogram
Visualization of the absolute abundances is shown on the left. Let us try the log10 transformation. Now, the data contains many zeros and taking log10 will yield infinite values. Hence we choose the commonly used, although somewhat problematic, log10(1+x) transformation (right).
```{r univariate_boxplot, warning=FALSE, message=FALSE, fig.width=10, fig.height=5, out.width="600px"}
library(ggplot2)
p1 <- ggplot(df, aes(x = Group, y = Abundance)) +
geom_boxplot() +
labs(title = "Absolute abundances", y = "Abundance (read count)")
# Let us add the log10(1+x) version:
df$Log10_Abundance <- log10(1 + df$Abundance)
p2 <- ggplot(df, aes(x = Group, y = Log10_Abundance)) +
geom_boxplot() +
labs(title = "Log10 abundances", y = "Abundance (log10(1+x) read count)")
library(patchwork)
p1 + p2
```
## Statistical comparison of two groups
**Task: Test whether abundance differences are statistically significant between the two groups** Tips: t-test (t.test); Wilcoxon test (wilcox.test). Find information on how to use by typing help(t.test) or help(wilcox.test); or by looking for examples from the web.
The groups seem to differ.
First, let us perform the t-test. This is based on Gaussian assumptions. Each group is expected to follow Gaussian distribution.
Significance p-value with t-test:
```{r univariate2b, warning=FALSE, message=FALSE}
print(t.test(Log10_Abundance ~ Group, data = df)$p.value)
```
According to this, the abundances is not significantly different between the two groups (at $p<0.05$ level).
## Investigate assumptions of the t-test
**Task: Assess whether the abundance data is Gaussian or log-normal within each group** You can use for instance histogram (hist) or density plots (plot(density())).
Now let us investigate the Gaussian assumption of the t-test in more
detail. Let us try another visualization; the density plot.
```{r univariate_densityplot, warning=FALSE, message=FALSE}
p <- ggplot(df, aes(fill = Group, x = Log10_Abundance)) +
geom_density(alpha = 0.5)
print(p)
```
Apparently, the data is not even approximately Gaussian distributed. In such cases, a common procedure is to use non-parametric tests. These do not make assumptions of the data distribution but instead compare the ordering of the samples.
So, let us look at the significance p-value with Wilcoxon test (log10 data):
```{r univariate_wilcoxon, warning=FALSE, message=FALSE}
print(wilcox.test(Log10_Abundance ~ Group, data = df)$p.value)
```
But since the test is non-parametric, we can as well use the original absolute abundances; thelog transformation does not change sample ordering on which the Wilcoxon test is based.
Let us verify that the absolute abundances yield the same p-value for Wilcoxon test:
```{r univariate_wilcoxon2, warning=FALSE, message=FALSE}
print(wilcox.test(Abundance ~ Group, data = df)$p.value)
```
## Compare results between parametric and non-parametric tests
Let us compare how much the results would differ in the whole data
between t-test (parametric) and Wilcoxon test (non-parametric).To remove non-varying taxa that would demand extra scripting, let us for demonstration purposes now focus on core taxa that are observed in more than 20% of the samples with more than 3 reads.
```{r univariate4, warning=FALSE, message=FALSE}
# Core taxa to be tested
test.taxa <- core_members(d, prevalence = 20/100, detection = 3)
# Calculate p-values with the two different methods for each taxonomic unit
pvalue.ttest <- c()
pvalue.wilcoxon <- c()
for (taxa in test.taxa) {
# Create a new data frame for each taxonomic group
df <- data.frame(Abundance = abundances(d)[taxa,],
Log10_Abundance = log10(1 + abundances(d)[taxa,]),
Group = meta(d)$nationality)
pvalue.ttest[[taxa]] <- t.test(Log10_Abundance ~ Group, data = df)$p.value
pvalue.wilcoxon[[taxa]] <- wilcox.test(Abundance ~ Group, data = df)$p.value
}
# Arrange the results in a data.frame
pvalues <- data.frame(taxon = test.taxa,
pvalue.ttest = pvalue.ttest,
pvalue.wilcoxon = pvalue.wilcoxon)
# Note that multiple testing occurs.
# We must correct the p-values.
# let us apply the standard Benjamini-Hochberg False Discovery Rate (FDR)
# correction
pvalues$pvalue.ttest.adjusted <- p.adjust(pvalues$pvalue.ttest)
pvalues$pvalue.wilcoxon.adjusted <- p.adjust(pvalues$pvalue.wilcoxon)
```
Compare the distribution of raw and adjusteed p-values.
```{r univariate5, warning=FALSE, message=FALSE, fig.width=10, fig.height=5, out.width="600px"}
p1 <- ggplot(pvalues, aes(x = pvalue.wilcoxon)) +
geom_histogram() +
labs(title = "Raw p-values") +
ylim(c(0, 80))
p2 <- ggplot(pvalues, aes(x = pvalue.wilcoxon.adjusted)) +
geom_histogram() +
labs(title = "Adjusted p-values") +
ylim(c(0, 80))
print(p1 + p2)
```
Now compare these adjusted p-values between t-test and Wilcoxon test. Let us also highlight the p = 0.05 intervals.
```{r univariate6, warning=FALSE, message=FALSE, fig.width=8, fig.height=8, out.width="600px"}
p <- ggplot(data = pvalues,
aes(x = pvalue.ttest.adjusted,
y = pvalue.wilcoxon.adjusted)) +
geom_text(aes(label = taxon)) +
geom_abline(aes(intercept = 0, slope = 1)) +
geom_hline(aes(yintercept = 0.05), shape = 2) +
geom_vline(aes(xintercept = 0.05), shape = 2)
print(p)
```
# Linear models: the role of covariates
This section provides hands-on introduction to linear (and generalized linear) models.
**Task Fit linear model to compare abundance between the two groups.** You can use functions lm or glm, for instance.
## Fitting a linear model
Let us compare two groups with a linear model. We use Log10 abundances
since this is closer to the Gaussian assumptions than the absolute
count data. Fit a linear model with Gaussian variation as follows:
```{r lm, warning=FALSE, message=FALSE}
res <- glm(Log10_Abundance ~ Group, data = df, family = "gaussian")
```
## Interpreting linear model output
Investigate the model coefficients:
```{r lm_coefs, warning=FALSE, message=FALSE}
knitr::kable(summary(res)$coefficients, digits = 5)
```
The intercept equals to the mean in the first group:
```{r lm_coefs3, warning=FALSE, message=FALSE}
print(mean(subset(df, Group == "AAM")$Log10_Abundance))
```
The group term equals to the difference between group means:
```{r lm_coefs3b, warning=FALSE, message=FALSE}
print(mean(subset(df, Group == "AFR")$Log10_Abundance) -
mean(subset(df, Group == "AAM")$Log10_Abundance))
```
Note that the linear model (default) significance equals to t-test assuming equal variances.
```{r lm_signif, warning=FALSE, message=FALSE}
print(t.test(Log10_Abundance ~ Group, data = df, var.equal=TRUE)$p.value)
```
## Covariate testing
**Task: Investigate how sex and bmi affect the results. **
An important advantage of linear and generalized linear models, compared to plain t-test is that they allow incorporating additional variables, such as potential confounders (age, BMI, gender..):
```{r lm_glm_cov, warning=FALSE, message=FALSE}
# Add a covariate:
df$sex <- meta(d)$sex
df$bmi_group <- meta(d)$bmi_group
# Fit the model:
res <- glm(Log10_Abundance ~ Group + sex + bmi_group, data = df, family = "gaussian")
```
We can even include interaction terms:
```{r lm_glm, warning=FALSE, message=FALSE}
res <- glm(Log10_Abundance ~ Group * sex * bmi_group, data = df, family = "gaussian")
kable(coefficients(res))
```
For more examples on using and analysing linear models, see statmethods [regression](https://www.statmethods.net/stats/regression.html) and [ANOVA](See also [statmethods](https://www.statmethods.net/stats/anova.html) tutorials. **Try to adapt those examples on our microbiome example data data sets**.
# Advanced models for differential abundance
GLMs are the basis for advanced testing of differential abundance in
sequencing data. This is necessary, as the sequencing data sets
deviate from symmetric, continuous, Gaussian assumptions in many ways.
## Particular properties of taxonomic profiling data
### Discrete count data
Sequencing data consists of discrete counts:
```{r pooled2, warning=FALSE, message=FALSE}
print(abundances(d)[1:5,1:3])
```
### Sparsity
The data is sparse:
```{r pooled3, warning=FALSE, message=print, fig.width=8, fig.height=8}
hist(log10(1 + abundances(d)), 100)
```
### Rarity
Long tails of rare taxa:
```{r tail, warning=FALSE, message=print, fig.width=16, fig.height=8, out.width="600px"}
library(reshape2)
medians <- apply(abundances(d),1,median)/1e3
A <- melt(abundances(d))
A$Var1 <- factor(A$Var1, levels = rev(names(sort(medians))))
p <- ggplot(A, aes(x = Var1, y = value)) +
geom_boxplot() +
labs(y = "Abundance (reads)", x = "Taxonomic Group") +
scale_y_log10()
print(p)
```
### Overdispersion
Variance exceeds the mean:
```{r pooled_overdispersion, warning=FALSE, message=FALSE, fig.width=8, fig.height=8, out.width="600px"}
means <- apply(abundances(d),1,mean)
variances <- apply(abundances(d),1,var)
# Calculate mean and variance over samples for each taxon
library(reshape2)
library(dplyr)
df <- melt(abundances(d))
names(df) <- c("Taxon", "Sample", "Reads")
df <- df %>% group_by(Taxon) %>%
summarise(mean = mean(Reads),
variance = var(Reads))
# Illustrate overdispersion
library(scales)
p <- ggplot(df, aes(x = mean, y = variance)) +
geom_point() +
geom_abline(aes(intercept = 0, slope = 1)) +
scale_x_log10(labels = scales::scientific) +
scale_y_log10(labels = scales::scientific) +
labs(title = "Overdispersion (variance > mean)")
print(p)
```
## Generalized linear models: a brief overview
Let us briefly discuss the ideas underlying generalized linear models.
The Generalized linear model (GLM) allows a richer family of
probability distributions to describe the data. Intuitively speaking,
GLMs allow the modeling of nonlinear, nonsymmetric, and nongaussian
associations. GLMs consist of three elements:
- A probability distribution for the data (from exponential family)
- A linear predictor targeting the mean, or expectation: $Xb$
- A link function g such that $E(Y) = \mu = g^{-1}(Xb)$.
Let us fit Poisson with (natural) log-link just to demonstrate how
generalized linear models could be fitted in R. We fit the abundance
(read counts) assuming that the data is Poisson distributed, and the
logarithm of its mean, or expectation, is obtained with a linear
model. For further examples in R, you can also check the [statmethods website](https://www.statmethods.net/advstats/glm.html).
```{r lm_glm2, warning=FALSE, message=FALSE}
# Load again the example data
d <- dietswap
df <- data.frame(Abundance = abundances(d)[taxa,],
Group = meta(d)$nationality)
res <- glm(Abundance ~ 1, data = df, family = "poisson")
```
Investigate the model output:
```{r lm_glm2b, warning=FALSE, message=FALSE}
knitr::kable(summary(res)$coefficients, digits = 5)
```
Note the link between mean and estimated coefficient ($\mu = e^{Xb}$):
```{r lm_glm2c, warning=FALSE, message=FALSE}
mean(df$Abundance)
exp(coef(res))
```
## DESeq2: differential abundance testing for sequencing data
### Fitting DESeq2
[DESeq2 analysis]((https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) accommodates those particular assumptions about
sequencing data.
```{r pooled_deseq, warning=FALSE, message=FALSE}
# Start by converting phyloseq object to deseq2 format
library(DESeq2)
d <- dietswap # Phyloseq data
ds2 <- phyloseq_to_deseq2(d, ~ group + nationality)
# Run DESeq2 analysis (all taxa at once!)
dds <- DESeq(ds2)
# Investigate results
deseq.results <- as.data.frame(results(dds))
deseq.results$taxon <- rownames(results(dds))
# Sort (arrange) by pvalue and effect size
library(knitr)
deseq.results <- deseq.results %>%
arrange(pvalue, log2FoldChange)
# Print the result table
# Let us only show significant hits
knitr::kable(deseq.results %>%
filter(pvalue < 0.05 & log2FoldChange > 1.5),
digits = 5)
```
### Comparison between DESeq2 and standard models
For comparison purposes, assess significances and effect sizes based on Wilcoxon test.
```{r pooled_topw, warning=FALSE, message=FALSE, fig.width=10, fig.height=5, out.width="200px"}
test.taxa <- taxa(d)
pvalue.wilcoxon <- c()
foldchange <- c()
for (taxa in test.taxa) {
# Create a new data frame for each taxonomic group
df <- data.frame(Abundance = abundances(d)[taxa,],
Log10_Abundance = log10(1 + abundances(d)[taxa,]),
Group = meta(d)$nationality)
# Calculate pvalue and effect size (difference beween log means)
pvalue.wilcoxon[[taxa]] <- wilcox.test(Abundance ~ Group, data = df)$p.value
foldchange[[taxa]] <- coef(lm(Log10_Abundance ~ Group, data = df))[[2]]
}
# Correct p-values for multiple testing
pvalue.wilcoxon.adjusted <- p.adjust(pvalue.wilcoxon)
```
```{r pooled_pcomp, warning=FALSE, message=FALSE, fig.width=10, fig.height=5}
par(mfrow = c(1,2))
plot(deseq.results$padj, pvalue.wilcoxon.adjusted,
xlab = "DESeq2 adjusted p-value",
ylab = "Wilcoxon adjusted p-value",
main = "P-value comparison")
abline(v = 0.05, h = 0.05, lty = 2)
plot(deseq.results$log2FoldChange, foldchange,
xlab = "DESeq2",
ylab = "Linear model",
main = "Effect size comparison")
abline(0,1)
```
For systematic comparisons between various methods for differential abundance testing, see [this paper](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0237-y).
# Multivariate comparisons of microbial community composition
The above examples focus on comparison per individual taxonomic group. Often, the groups are correlated and we are interested in comparing the overall community composition.
## PERMANOVA
Permutational multivariate analysis of variance [further reading](https://onlinelibrary.wiley.com/doi/10.1002/9781118445112.stat07841). See also [statmethods](https://www.statmethods.net/stats/anova.html).
```{r}
library(vegan)
pseq <- dietswap
# Pick relative abundances (compositional) and sample metadata
pseq.rel <- microbiome::transform(pseq, "compositional")
otu <- abundances(pseq.rel)
meta <- meta(pseq.rel)
# samples x species as input
library(vegan)
permanova <- adonis(t(otu) ~ group,
data = meta, permutations=99, method = "bray")
# P-value
print(as.data.frame(permanova$aov.tab)["group", "Pr(>F)"])
```
## Checking the homogeneity condition
Type `?betadisper` in R console for more information.
```{r permanovacheck}
# Note the assumption of similar multivariate spread among the groups
# ie. analogous to variance homogeneity
# Here the groups have signif. different spreads and
# permanova result may be potentially explained by that.
dist <- vegdist(t(otu))
anova(betadisper(dist, meta$group))
permutest(betadisper(dist, meta$group), pairwise = TRUE)
```
We can also check which taxa contribute most to the community differences. Are these same or different compared to DESeq2?
```{r permanovatop, out.width="600px"}
coef <- coefficients(permanova)["group1",]
top.coef <- coef[rev(order(abs(coef)))[1:20]]
par(mar = c(3, 14, 2, 1))
barplot(sort(top.coef), horiz = T, las = 1, main = "Top taxa")
```
# Further exercises
When done with the differential abundance testing examples, you can investigate the use of the following standard methods for microbiome studies. We are available to discuss and explain the technical aspects in more detail during the class.
## Compositionality
**Compositionality effect** compare the effect of CLR transformation (microbiome::clr) on microbiome analysis results. 1) Compare t-test and/or Wilcoxon test results between data that is transformed with compositional or clr transformation (see the function microbiome::transform); and/or 2) Prepare PCoA with Bray-Curtis distances for compositional data; and PCoA with euclidean distances for CLR-transformed data (microbiome::transform). For examples, see [microbiome tutorial](http://microbiome.github.io/microbiome/Landscaping.html).
## Redundancy analysis (RDA)
A very good overview of various multivariate methods used in microbial ecology is provided [here](http://www.wright.edu/~oleg.paliy/Papers/Paliy_ME2016.pdf). Read the Redundancy analysis (and possibly other) section. Next, try to perform simple redundancy analysis in R based on the following examples.
Standard RDA for microbiota profiles versus the given (here 'time')
variable from sample metadata (see also the RDA method in
phyloseq::ordinate)
```{r rda2, warning=FALSE, message=FALSE, eval=FALSE}
x <- transform(dietswap, "compositional")
otu <- abundances(x)
metadata <- meta(x)
library(vegan)
rda.result <- vegan::rda(t(otu) ~ factor(metadata$nationality),
na.action = na.fail, scale = TRUE)
```
Visualize the standard RDA output:
```{r rda4, warning=FALSE, message=FALSE, fig.width=8, fig.height=8, eval=FALSE}
plot(rda.result, choices = c(1,2), type = "points", pch = 15, scaling = 3, cex = 0.7, col = metadata$time)
points(rda.result, choices = c(1,2), pch = 15, scaling = 3, cex = 0.7, col = metadata$time)
pl <- ordihull(rda.result, metadata$nationality, scaling = 3, label = TRUE)
```
Test RDA significance:
```{r rda2bc, warning=FALSE, message=FALSE, eval=FALSE}
permutest(rda.result)
```
Include confounding variables:
```{r rda3, warning=FALSE, message=FALSE, fig.width=8, fig.height=8, eval=FALSE}
rda.result2 <- vegan::rda(t(otu) ~ metadata$nationality + Condition(metadata$bmi_group + metadata$sex))
```
# Citation
If you found this book useful, please cite:
Shetty Sudarshan A, Lahti Leo, Hermes Gerben DA, & Hauke Smidt. (2018, September 27). Microbial bioinformatics introductory course material 2018 (Version 0.01). Zenodo. http://doi.org/10.5281/zenodo.1436630