Add plotRDA function to beta diversity chapter (#343)

* Add link to benchmarking and minor polish * Simplify section on supervised ordination * Add clarifications to DAA with confounding * Fix beta diversity bug * Minor change * Add exercise on DAA method comparison * Minor fix * Minor fix * Streamline RDA section with new plotRDA function * Fix rmd table in beta diversity chapter * Fix miaTime missing error * Fix miaTime missing error * Add dendextend to DESCRIPTION * Add other missing deps * Fix dep names * Add multiassay analyses deps * Remove reticulate from deps * Add deps for extra materials * Improve PCoA example * Fix deployment * Implement pseudocount = TRUE and minor fixes * Update 30_differential_abundance (#348) * Replace pseudocount 1 with TRUE throughout book * Add table of typical beta div combinations * Fix pseudocount bug * Update beta diversity table --------- Co-authored-by: Elina297 <[email protected]>
microbiome · Sep 27, 2023 · 1bcbabe · 1bcbabe
1 parent bcf0961
commit 1bcbabe
Show file tree

Hide file tree

Showing 6 changed files with 99 additions and 126 deletions.
diff --git a/04_containers.Rmd b/04_containers.Rmd
@@ -68,7 +68,7 @@ Let us load example data and rename it as tse.
 
 ```{r}
 library(mia)
-data(hitchip1006, package="miaTime")
+data("hitchip1006", package = "miaTime")
 tse <- hitchip1006
 ```
 

diff --git a/20_beta_diversity.Rmd b/20_beta_diversity.Rmd
@@ -16,18 +16,29 @@ knitr::opts_chunk$set(
 
 # Community Similarity {#community-similarity}
 
-Whereas alpha diversity focuses on community variation within a community
-(one sample), beta diversity quantifies the dissimilarity between communities
-(multiple samples). In microbiome research, the most popular metrics of beta
+Beta diversity quantifies the dissimilarity between communities (multiple
+samples), as opposed to alpha diversity which focuses on variation within a
+community (one sample). In microbiome research, commonly used metrics of beta
 diversity include the Bray-Curtis index (for compositional data), Jaccard index
-(for presence / absence data, ignoring abundance information), Aitchison distance
+(for presence/absence data, ignoring abundance information), Aitchison distance
 (Euclidean distance for clr transformed abundances, aiming to avoid the
 compositionality bias), and the Unifrac distance (that takes into account the
 phylogenetic tree information). Notably, only some of these measures are actual
 _distances_, as this is a mathematical concept whose definition is not satisfied
 by certain ecological measure, such as the Bray-Curtis index. Therefore, the terms
 dissimilarity and beta diversity are preferred.
 
+| Method description          | Assay type          | Beta diversity metric |
+|:---------------------------:|:-------------------:|:---------------------:|
+| Quantitative profiling      | Absolute counts     | Bray-Curtis           | 
+| Relative profiling          | Relative abundances | Bray-Curtis           |
+| Aitchison distance          | Absolute counts     | Aitchison             |
+| Aitchison distance          | clr                 | Euclidean             |
+| Robust Aitchison distance   | rclr                | Euclidean             |
+| Presence/Absence similarity | Relative abundances | Jaccard               |
+| Presence/Absence similarity | Absolute counts     | Jaccard               |
+| Phylogenetic distance       | Rarefied counts     | Unifrac               |
+
 In practice, beta diversity is usually represented as a `dist` object, a
 triangular matrix where the distance between each pair of samples is encoded by
 a specific cell. This distance matrix can then undergo ordination, which is an
@@ -47,23 +58,6 @@ Reduction (UMAP), whereas the latter is mainly represented by distance-based
 Redundancy Analysis (dbRDA). We will first discuss unsupervised ordination
 methods and then proceed to supervised ones.
 
-To run the examples in this chapter, the following packages should be imported:
-
-* mia: microbiome analysis framework
-* scater: plotting reduced dimensions
-* vegan: ecological distances
-* ggplot2: plotting
-* patchwork: combining plots
-* dplyr: pipe operator
-
-```{r betadiv-packages, include = FALSE}
-library(mia)
-library(scater)
-library(vegan)
-library(ggplot2)
-library(patchwork)
-library(dplyr)
-```
 
 ## Unsupervised ordination {#unsupervised-ordination}
 
@@ -75,16 +69,22 @@ demonstration we will analyse beta diversity in GlobalPatterns, and observe the
 variation between stool samples and those with a different origin.
 
 ```{r prep-tse}
-# Example data
+# Load mia and import sample dataset
+library(mia)
 data("GlobalPatterns", package = "mia")
-
-# Data matrix (features x samples)
 tse <- GlobalPatterns
 
-# some beta diversity metrics are usually applied to relative abundances
+# Beta diversity metrics like Bray-Curtis are often applied to relabundances
 tse <- transformAssay(tse,
+                      assay.type = "counts",
                       method = "relabundance")
 
+# Other metrics like Aitchison to clr-transformed data
+tse <- transformAssay(tse,
+                      assay.type = "relabundance",
+                      method = "clr",
+                      pseudocount = TRUE)
+
 # Add group information Feces yes/no
 tse$Group <- tse$SampleType == "Feces"
 ```
@@ -106,12 +106,15 @@ dimensions via an ordination method, the results of which can be stored in the
 and `runNMDS` functions.
 
 ```{r runMDS}
-# Perform PCoA
+# Load package to plot reducedDim
+library(scater)
+
+# Run PCoA on relabundance assay with Bray-Curtis distances
 tse <- runMDS(tse,
               FUN = vegan::vegdist,
               method = "bray",
-              name = "PCoA_BC",
-              assay.type = "relabundance")
+              assay.type = "relabundance",
+              name = "MDS_bray")
 ```
 
 Sample dissimilarity can be visualized on a lower-dimensional display (typically
@@ -121,11 +124,11 @@ size and other aesthetics. Can you find any difference between the groups?
 
 ```{r plot-mds-bray-curtis, fig.cap = "MDS plot based on the Bray-Curtis distances on the GlobalPattern dataset."}
 # Create ggplot object
-p <- plotReducedDim(tse, "PCoA_BC",
+p <- plotReducedDim(tse, "MDS_bray",
                     colour_by = "Group")
 
 # Calculate explained variance
-e <- attr(reducedDim(tse, "PCoA_BC"), "eig")
+e <- attr(reducedDim(tse, "MDS_bray"), "eig")
 rel_eig <- e / sum(e[e > 0])
 
 # Add explained variance for each axis
@@ -135,32 +138,54 @@ p <- p + labs(x = paste("PCoA 1 (", round(100 * rel_eig[[1]], 1), "%", ")", sep
 p
 ```
 
-With additional tools from the ggplot2 package, ordination methods can be 
-compared to find similarities between them or select the most suitable one to
-visualize beta diversity in the light of the research question.
+A few combinations of beta diversity metrics and assay types are typically
+used. For instance, Bray-Curtis dissimilarity and Euclidean distance are often
+applied to the relative abundance and the clr assays, respectively. Besides
+**beta diversity metric** and **assay type**, the **PCoA algorithm** is also a
+variable that should be considered. Below, we show how the choice of these three
+factors can affect the resulting lower-dimensional data.
+
+```{r mds-nmds-comparison, results='hide'}
+# Run NMDS on relabundance assay with Bray-Curtis distances
+tse <- runNMDS(tse,
+               FUN = vegan::vegdist,
+               method = "bray",
+               assay.type = "relabundance",
+               name = "NMDS_bray")
 
-```{r plot-mds-nmds-comparison, fig.cap = "Comparison of MDS and NMDS plots based on the Bray-Curtis or euclidean distances on the GlobalPattern dataset."}
+# Run MDS on clr assay with Aitchison distances
 tse <- runMDS(tse,
               FUN = vegan::vegdist,
-              name = "MDS_euclidean",
               method = "euclidean",
-              assay.type = "counts")
+              assay.type = "clr",
+              name = "MDS_aitchison")
 
+# Run NMDS on clr assay with Euclidean distances
 tse <- runNMDS(tse,
                FUN = vegan::vegdist,
-               name = "NMDS_BC")
+               method = "euclidean",
+               assay.type = "clr",
+               name = "NMDS_aitchison")
+```
 
-tse <- runNMDS(tse,
-               FUN = vegan::vegdist,
-               name = "NMDS_euclidean",
-               method = "euclidean")
+Multiple ordination plots are combined into a multi-panel plot with the
+patchwork package, so that different methods can be compared to find similarities
+between them or select the most suitable one to visualize beta diversity in the
+light of the research question.
+
+```{r, fig.cap = "Comparison of MDS and NMDS plots based on the Bray-Curtis or Aitchison distances on the GlobalPattern dataset."}
+# Load package for multi-panel plotting
+library(patchwork)
 
-plots <- lapply(c("PCoA_BC", "MDS_euclidean", "NMDS_BC", "NMDS_euclidean"),
+# Generate plots for all 4 reducedDims
+plots <- lapply(c("MDS_bray", "MDS_aitchison",
+                  "NMDS_bray", "NMDS_aitchison"),
                 plotReducedDim,
                 object = tse,
                 colour_by = "Group")
 
-((plots[[1]] | plots[[2]]) / (plots[[3]] | plots[[4]])) +
+# Generate multi-panel plot
+wrap_plots(plots) +
   plot_layout(guides = "collect")
 ```
 
@@ -169,16 +194,14 @@ relationship of features in form on a `phylo` tree. `calculateUnifrac`
 performs the calculation to return a `dist` object, which can again be
 used within `runMDS`.
 
-```{r}
+```{r plot-unifrac, fig.cap = "Unifrac distances scaled by MDS of the GlobalPattern dataset."}
 tse <- runMDS(tse,
               FUN = mia::calculateUnifrac,
               name = "Unifrac",
               tree = rowTree(tse),
               ntop = nrow(tse),
               assay.type = "counts")
-```
 
-```{r plot-unifrac, fig.cap = "Unifrac distances scaled by MDS of the GlobalPattern dataset."}
 plotReducedDim(tse, "Unifrac",
                colour_by = "Group")
 ```
@@ -240,6 +263,9 @@ would report relative stress, which varies in the unit interval and is better
 if smaller. This can be calculated as shown below.
 
 ```{r relstress}
+# Load vegan package
+library(vegan)
+
 # Quantify dissimilarities in the original feature space
 x <- assay(tse, "relabundance") # Pick relabunance assay separately
 d0 <- as.matrix(vegdist(t(x), "bray"))
@@ -282,10 +308,10 @@ them. The result shows how much each covariate affects beta diversity. The table
 below illustrates the relation between supervised and unsupervised ordination
 methods.
 
-|                           | supervised ordination  | unsupervised ordination  |
-|:-------------------------:|:----------------------:|:------------------------:|
-| Euclidean distance        | RDA                    | PCA                      |
-| non-Euclidean distance    | dbRDA                  | PCoA                     |
+|                          | supervised ordination  | unsupervised ordination  |
+|:------------------------:|:----------------------:|:------------------------:|
+| Euclidean distance       | RDA                    | PCA                      |
+| non-Euclidean distance   | dbRDA                  | PCoA/MDS, NMDS and UMAP  |
 
 We demonstrate the usage of dbRDA with the enterotype dataset, where samples
 correspond to patients. The colData contains the clinical status of each patient
@@ -325,7 +351,7 @@ function. We see that both clinical status and age explain more than 10% of the
 variance, but only age shows statistical significance.
 
 ```{r rda-permanova-res}
-rda_info$permanova %>%
+rda_info$permanova |>
   knitr::kable()
 ```
 
@@ -334,79 +360,20 @@ information from the results of RDA. In this case, none of the p-values is lower
 than the significance threshold, and thus homogeneity is observed.
 
 ```{r rda-homogeneity-res}
-rda_info$homogeneity %>%
+rda_info$homogeneity |>
   knitr::kable()
 ```
 
 Next, we proceed to visualize the weight and significance of each variable on
 the similarity between samples with an RDA plot, which can be generated with
-the following custom function.
+the `plotRDA` function from the miaViz package.
 
 ```{r plot-rda}
 # Load packages for plotting function
-library(stringr)
-library(ggord)
-
-rda <- attr(reducedDim(tse2, "RDA"), "rda")
-
-# Covariates that are being analyzed
-variable_names <- c("ClinicalStatus", "Gender", "Age")
-
-# Since na.exclude was used, if there were rows missing information, they were 
-# dropped off. Subset coldata so that it matches with rda.
-coldata <- colData(tse2)[ rownames(rda$CCA$wa), ]
-
-# Adjust names
-# Get labels of vectors
-vec_lab_old <- rownames(rda$CCA$biplot)
-
-# Loop through vector labels
-vec_lab <- sapply(vec_lab_old, FUN = function(name){
-    # Get the variable name
-    variable_name <- variable_names[ str_detect(name, variable_names) ]
-    # If the vector label includes also group name
-    if( !any(name %in% variable_names) ){
-        # Get the group names
-        group_name <- unique( coldata[[variable_name]] )[ 
-        which( paste0(variable_name, unique( coldata[[variable_name]] )) == name ) ]
-        # Modify vector so that group is separated from variable name
-        new_name <- paste0(variable_name, " \U2012 ", group_name)
-    } else{
-        new_name <- name
-    }
-    # Add percentage how much this variable explains, and p-value
-    new_name <- expr(paste(!!new_name, " (", 
-                           !!format(round( rda_info$permanova[variable_name, "Explained variance"]*100, 1), nsmall = 1), 
-                           "%, ",italic("P"), " = ", 
-                           !!gsub("0\\.","\\.", format(round( rda_info$permanova[variable_name, "Pr(>F)"], 3), 
-                                                       nsmall = 3)), ")"))
-
-    return(new_name)
-})
-# Add names
-names(vec_lab) <- vec_lab_old
-
-# Create labels for axis
-xlab <- paste0("RDA1 (", format(round( rda$CCA$eig[[1]]/rda$CCA$tot.chi*100, 1), nsmall = 1 ), "%)")
-ylab <- paste0("RDA2 (", format(round( rda$CCA$eig[[2]]/rda$CCA$tot.chi*100, 1), nsmall = 1 ), "%)")
-
-# Create a plot        
-plot <- ggord(rda, grp_in = coldata[["ClinicalStatus"]], vec_lab = vec_lab,
-              alpha = 0.5,
-              size = 4, addsize = -4,
-              #ext= 0.7, 
-              txt = 3.5, repel = TRUE, 
-              #coord_fix = FALSE
-          ) + 
-    # Adjust titles and labels
-    guides(colour = guide_legend("ClinicalStatus"),
-           fill = guide_legend("ClinicalStatus"),
-           group = guide_legend("ClinicalStatus"),
-           shape = guide_legend("ClinicalStatus"),
-           x = guide_axis(xlab),
-           y = guide_axis(ylab)) +
-    theme( axis.title = element_text(size = 10) )
-plot
+library(miaViz)
+
+# Generate RDA plot coloured by clinical status
+plotRDA(tse2, "RDA", colour_by = "ClinicalStatus")
 ```
 
 From the plot above, we can see that only age significantly describes

diff --git a/23_multi-assay_analyses.Rmd b/23_multi-assay_analyses.Rmd
@@ -102,7 +102,7 @@ bacterium X is present, is the concentration of metabolite Y lower or higher"?
 # Agglomerate microbiome data at family level
 mae[[1]] <- mergeFeaturesByPrevalence(mae[[1]], rank = "Family")
 # Does log10 transform for microbiome data
-mae[[1]] <- transformAssay(mae[[1]], method = "log10", pseudocount = 1)
+mae[[1]] <- transformAssay(mae[[1]], method = "log10", pseudocount = TRUE)
 
 # Give unique names so that we do not have problems when we are creating a plot
 rownames(mae[[1]]) <- getTaxonomyLabels(mae[[1]])
@@ -193,8 +193,8 @@ mae[[2]] <- transformAssay(mae[[2]], assay.type = "nmr",
 
 # Transforming biomarker data with z-transform
 mae[[3]] <- transformAssay(mae[[3]], assay.type = "signals",
-                            MARGIN = "features",
-                            method = "z", pseudocount = 1)
+                           MARGIN = "features",
+                           method = "z", pseudocount = 1)
 
 # Removing assays no longer needed
 assay(mae[[1]], "counts") <- NULL

diff --git a/30_differential_abundance.Rmd b/30_differential_abundance.Rmd
@@ -422,7 +422,6 @@ zicoseq_res %>%
 ```{r plot-zicoseq}
 ## x-axis is the effect size: R2 * direction of coefficient
 ZicoSeq.plot(ZicoSeq.obj = zicoseq_out,
-             meta.dat = as.data.frame(colData(tse)),
              pvalue.type = 'p.adj.fdr')
 ```
 

diff --git a/97_extra_materials.Rmd b/97_extra_materials.Rmd
@@ -232,23 +232,20 @@ plot(posterior, par="Lambda", focus.cov = rownames(X)[c(2,4)])
 ## Interactive 3D Plots
 
 ```{r, message=FALSE, warning=FALSE}
-# Installing libraryd packages
+# Load libraries
 library(rgl)
 library(plotly)
 ```
 
 ```{r setup2, warning=FALSE, message=FALSE}
 library(knitr)
-library(rgl)
 knitr::knit_hooks$set(webgl = hook_webgl)
 ```
 
 
 In this section we make a 3D version of the earlier  Visualizing the most dominant genus on PCoA (see \@ref(quality-control)), with the help of the plotly [@Sievert2020].
 
 ```{r, message=FALSE, warning=FALSE}
-# Installing the package
-library(curatedMetagenomicData)
 # Importing necessary libraries
 library(curatedMetagenomicData)
 library(dplyr)