Diversity and composition

Analysis of diversity and composition

Alpha diversity

Alpha diversity represents within-sample diversity. We can now make use of the Shannon diversity metric information we added to our phyloseq objects earlier to investigate whether there are changes based on some sample metadata of our choosing.

Alpha diversity measures and an introduction to Hill diversity indices are summarised nicely in the article "A conceptual guide to measuring species diversity".

Chao1 diversity is a measure of richness, and even the best asymptotic richness estimators (such as Chao1) cannot reliably predict the true community diversity. The problem lies in that richness measurements are strongly influenced by the rarest species, which are precisely the species that we know least about. Another way to put this is that richness has a high uncertainty.
The "traditional" diversity indices that explicitly include relative abundance, such as the Shannon and Simpson indices, are more robust than richness to the sampling problems that affect richness. However, these measures have their own units and do not scale intuitively (or even similarly) with species gain and loss.
Hill–Shannon diversity lies between richness and Hill–Simpson diversity, and may be the ‘just right' measure in many applications (Kempton 1979). The geometric mean affords leverage to extreme values according to their proportional, not absolute, difference from the mean. Thus, it can respond strongly to both very high and to very low rarity values. Another argument in favor of Hill–Shannon is that many species abundance distributions are approximately log-normal (Williamson and Gaston 2005, McGill et al. 2007), and thus their central tendency might be well described by the geometric mean.

The approach for this will be different depending on whether your metadata of interest is categorical or numerical, and how you want to split up your data (if at all – for example, you may have a numeric explanatory variable, but also a categorical grouping factor).

Example: longitudinal experiment with two groups

Perhaps we have a longitudinal study where faecal samples have been taken from the same animals at five distinct time points, and the animals belong to one of two treatment groups, e.g. a Sham group and a drug treatment group.

Firstly we may want to determine if there are changes from the baseline alpha diversity (timepoint 0). One way we could investigate that is as follows:

# Transform the sample data into a data.frame
bact_data_samples <- data.frame(sample_data(bact_data_filtered))

# Plot the diversity over time by group
(bact_diversity_plot <- ggplot(bact_data_samples, aes(x = days_postTx, y = Diversity)) +
    geom_path(aes(group = rat_id), col = 'grey70') + # draw a path between sample from the same animal
    geom_violin(aes(y = Diversity, fill = group), scale = 'area') +
    geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 0.5, binwidth = 1/30 * diff(range(bact_data_samples$Diversity))) +
    geom_boxplot(aes(y = Diversity, fill = group), width = 0.2) +
    scale_fill_jama(alpha = 0.5, name = 'Treatment Group') +
    guides(fill = 'none') +
    labs(title = 'Bacterial Diversity (Shannon Index)',
         x = 'Time Since Treatment (days)') +
    facet_grid(cols = vars(group)) +
    stat_compare_means(comparisons = list(c('0', '7'),
                                          c('0', '14'),
                                          c('0', '21'),
                                          c('0', '28')))
)
ggsave(here::here('figures', 'diversity', 'bact_diversity_by_group.pdf'), bact_diversity_plot,
       width = 16, height = 12, units = 'cm')

That would produce this plot:

Alternatively, we may decide we want to look at whether the treatment affects alpha diversity between groups over time. For example, there may be some temporal element affecting alpha diversity that is independent of the treatment. This is easy to alter by changing the faceting variable, and removing the geom_path() element.

# Transform the sample data into a data.frame
bact_data_samples <- data.frame(sample_data(bact_data_filtered))

# Prepare facet labels (more descriptive than the numbers alone)
# We will use the labeller within facet_grid to use these labels
facet_col_labels <- c('0' = '0 days post-Tx',
                      '7' = '7 days post-Tx',
                      '14' = '14 days post-Tx',
                      '21' = '21 days post-Tx',
                      '28' = '28 days post-Tx')

# Plot the diversity over time between groups
(bact_diversity_plot2 <- ggplot(bact_data_samples, aes(x = group, y = Diversity)) +
    geom_violin(aes(y = Diversity, fill = group), scale = 'area') +
    geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 0.5, 
                 binwidth = 1/30 * diff(range(bact_data_samples$Diversity))) +
    geom_boxplot(aes(y = Diversity, fill = group), width = 0.2) +
    scale_fill_jama(alpha = 0.5, name = 'Treatment Group') +
    guides(fill = 'none') +
    labs(title = 'Bacterial Diversity (Shannon Index)',
         x = 'Treatment Group') +
    facet_grid(cols = vars(days_postTx), scales = 'free', labeller = as_labeller(facet_col_labels)) +
    stat_compare_means(comparisons = list(c('Sham', 'DrugXYZ')))
)
ggsave(here::here('figures', 'diversity', 'bact_diversity_by_time.pdf'), bact_diversity_plot2,
       width = 16, height = 12, units = 'cm')

That would produce this plot:

TOP OF PAGE

We can also loop through all the available diversity metrics and produce a combined plot with ggarrange as follows:

# Plot the diversity metrics
diversity_plots <- list()
for (i in c('Chao1', 'Shannon', 'Simpson', 'Hill_Shannon_phylo', 'Hill_Shannon_taxa')) {
  p <- ggplot(bact_df_samples, aes(x = group, y = .data[[i]])) +
    geom_violin(aes(y = .data[[i]], fill = group), scale = 'area', trim = FALSE) +
    geom_boxplot(aes(y = .data[[i]], fill = group), width = 0.2) +
    geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 1) +
    scale_fill_jama(alpha = 0.5, name = 'Group') +
    scale_color_aaas() +
    labs(title = paste0(i, ' Index'),
         x = 'Group',
         y = paste0(i, ' Diversity Index')) +
    stat_compare_means(comparisons = list(c('Sham', 'DrugXYZ')), size = 3) +
    theme_pubr(legend = 'right')
  diversity_plots[[i]] <- p
}
p_arr <- ggarrange(plotlist = diversity_plots, nrow = 1, common.legend = TRUE, legend = 'none')
ggsave(here::here('figures', 'diversity', 'treatment_diversity_metrics.pdf'), p_arr,
       width = 35, height = 8, units = 'cm')

Beta diversity

Beta diversity represents between-sample diversity. Continuing on from our example above, because the taxonomic composition may vary greatly with time, if we were to plot ordination plots for all samples together, we may lose clarity and minimise resolution/separation of data points at each of the time points individually.

In this example, the data came from 16S rRNA amplicon sequencing, so we have access to the de novo phylogenetic tree, and can use the UniFrac distance metric. UniFrac distances make use of the tree, by assigning short distances to a pair of ASVs that are highly similar, and larger distances to those that are more dissimilar. If however we were using a phyloseq derived from shotgun metagenomic sequencing data, we could use another distance metric, e.g. Bray-Curtis ('bray').

To save time and code, let's loop through the ordinations at each timepoint by creating temporary phyloseq subsets, add their plotted ordinations to a plot list, and finally combine the plots into a single figure with ggarrange().

If you just have a single categorical variable however, you can make use of the plot code within the loop, i.e. p <- plot_ordination() etc.

# Choose parameters for downstream analyses
met <- 'PCoA'
dist <- 'unifrac'
bact <- bact_data_logCSS

# Bacterial ordination (separately for each temporal group)
ord_list <- list() # blank list to hold ordination plots
for (day in levels(bact@sam_data$days_postTx)) { # loop through the unique values of 'days_postTx' in the phyloseq sample_data()
  bact_tmp <- prune_samples(bact@sam_data$days_postTx == day, bact) # create temporary subsets of the phyloseq object
  
  p <- plot_ordination(bact_tmp, ordinate(bact_tmp, met, dist, weighted = TRUE), # ordinate via the phyloseq plot_ordination() function
                       title = paste0(day, ' days post-treatment')) +
    stat_ellipse(aes(fill = group), geom = 'polygon', type = 't', level = 0.95, alpha = 0.2) + # add group ellipses
    scale_shape_identity() +
    geom_point(aes(fill = group), shape = 21, size = 3) + # add the individual data points
    scale_fill_jama(name = 'Treatment Group') # scale the fill colour using ggsci scale_fill_jama() function
  
  ord_list[[day]] <- p # add the plot to the plot list
}

# Arrange ordination plots using ggpubr ggarrange() function
ord_plots <- ggarrange(plotlist = ord_list, nrow = 2, ncol = 3, common.legend = TRUE)
(ord_plots <- annotate_figure(ord_plots,
                              top = text_grob(label = paste0('Bacteria ', met, ' ', dist, ' ordination'))))
ggsave(here::here('figures', 'ordination', 'bact_PCoA_ordination_by_time.pdf'), ord_plots, # Save combined figure
       width = 20, height = 16, units = 'cm')

The resulting output file looks like this:

TOP OF PAGE

PERMANOVA of beta diversity

In addition to this, performing a PERMANOVA on your dissimilarity matrix will provide you with the relevant statistics for assessing whether there is significant differences between your groups – keep in mind that the ordination plot is simply a way of visualising differences between samples.

To do this, we will use the adonis2 function from the vegan package, after first generating our dissimilarity matrix using the package's vegdist.

Here we will show an example with Bray-Curtis distance, but if you were using 16S data and wanted to calculate the distance matrix with UniFrac distances, you could replace the vegdist function with the UniFrac function from the phyloseq package.
As far as the adonis2 function goes, the formula left-hand side takes a dissimilarity/distance matrix, and the right side takes the unquoted name(s) of columns from a data.frame (in this case the sample data object of our phyloseq object). As with other formulas in R, the right side can take multiple arguments, and can utilise both + and * notation.

# Generate the Bray-Curtis dissimilarity matrix (or the appropriate distance metric)
bact_bray_dist <- phyloseq::distance(bact_data_logCSS, method = 'bray')

# Run PERMANOVA
bact_bray_permanova <- adonis2(bact_bray_dist ~ group, data = data.frame(sample_data(bact_data_logCSS)), permutations = 1e4)

This will provide an output that looks something like this:

	Df	SumOfSqs	R2	F	Pr(>F)
timeframe	2	0.6863103	0.07398396	3.635218	0.001
Residual	91	8.5901631	0.92601604	NA	NA
Total	93	9.2764734	1.00000000	NA	NA

Taxonomic composition

To investigate sample taxonomic composition, we can either look at the average composition of groups (using some sample metadata variable) or at the composition of each individual sample.

In either case, we will first need to filter the taxa using the microbiome::core() function to increase the clarity of the resulting barplots. We will place that within the phyloseq::tax_glom() function – to agglomerate to the genus level for example – and then transform the data to compositional form, and use the phyloseq::psmelt() function to pivot the data longer for plotting with ggplot2.

Data preparation

# Filter taxa to increase clarity of barplots
bact_data_tax <- psmelt(microbiome::transform(tax_glom(core(bact_data_filtered, detection = 10, prevalence = 0.2), 
                                                       'Genus', NArm = T), transform = 'compositional', target = 'OTU'))

Next, we will prepare a colour vector for the barplots using the RColorBrewer package.

# Prepare colours for plotting
library(RColorBrewer)
qual_col_pals <- brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector <- unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))

TOP OF PAGE

Taxonomic barplot by group

We will first plot the "average" barplot using a grouping variable. This could be a treatment group, time frame, etc. In this example, we will continue on with the drug treatment example from above, using the treatment as our grouping variable and also facet by time frame (correcting the facet labels by passing the facet_col_labels vector we created above to the labeller argument of facet_grid()).

# Plot taxonomic composition by group over time
(bact_taxonomy_gen <- ggplot(bact_data_tax, aes(x = group, y = Abundance, fill = Genus, NArm = FALSE)) +
    geom_bar(stat = 'identity', position = 'fill', size = 0.01) +
    scale_fill_manual(values = col_vector) + # use the colour vector we created
    labs(title = 'Genera Distribution vs Treatment Group',
         x = 'Treatment Group',
         y = 'Abundance') +
    theme(legend.key.size = unit(3.5, 'mm'), # change the size of each legend box
          legend.position = 'bottom',
          plot.margin = margin(0.25, 3, 0.25, 1.5, unit = 'cm'), # increase margins around the plot
          axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + # rotate x-axis labels
    facet_grid(cols = vars(days_postTx), labeller = as_labeller(facet_col_labels))
)
ggsave(here::here('figures', 'composition', 'bact_taxonomy_genera_by_time.pdf'), bact_taxonomy_gen,
       width = 20, height = 15, units = 'cm')

The resulting output file looks like this:

TOP OF PAGE

Taxonomic composition by individual

Seeing as we also know the IDs of the animals, we can also plot the taxonomic composition for each individual over time, facetting by both group and time frame.

We also need to add the treatment names to the end of the facet_col_labels vector here, because we are now facetting by two factors. The labeller just takes a single named vector for all labels.

# Barplot of abundance for each individual by group over time
(bact_taxonomy_gen_rat <- ggplot(bact_data_tax, aes(x = rat_id, y = Abundance, fill = Genus, NArm = FALSE)) +
    geom_bar(stat = 'identity', position = 'fill', size = 0.01) +
    scale_fill_manual(values = col_vector) +
    labs(title = 'Genera Distribution vs Treatment Group',
         x = 'Rat',
         y = 'Abundance') +
    theme(legend.key.size = unit(3.5, 'mm'),
          legend.position = 'bottom',
          plot.margin = margin(0.25, 3, 0.25, 2.25, unit = 'cm'),
          axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    facet_grid(row = vars(days_postTx), cols = vars(group), scales = 'free', labeller = as_labeller(c(facet_col_labels,
                                                                                                      'Sham' = 'Sham',
                                                                                                      'DrugXYZ' = 'DrugXYZ')))
)
ggsave(here('figures', 'composition', 'bact_taxonomy_gen_rat_by_time.pdf'), bact_taxonomy_gen_rat,
       width = 20, height = 20, units = 'cm')

The resulting output file looks like this:

TOP OF PAGE

Ratios between bacterial phyla

In particular the ratio between Firmicutes and Bacteroidetes is a popular measure, often used in assessing the health of the gastrointestinal microbiome. That said, we can look at the ratios of all phyla using the phyla_ratios() function we provide here.

Arguments for `phyla_ratios()`

phyloseq_object: a phyloseq object to use for differential abundance testing. This should be a phyloseq object containing raw (not normalised) counts.
test_variable: the name of a single column from the sample_data to be used for comparison of phyla ratios.
average_reads_threshold (default = 1000): optional - the minimum number of reads per sample on average that a phylum should have in order to be retained for analysis of ratios. Assessed by multiplying this value by the number of samples – if the row sums of a given phylum have less than average_reads_threshold * number of samples, it will be removed.
plot_output_folder: optional - the path to a folder where you would like output plots to be saved. If left blank, no plots will be saved.
plot_file_prefix: optional - a string to attach to the start of the individual file names for your plots. This input is only used if the plot_output_folder argument is also provided.

Function output

phyla_ratios: a data.frame of the input data for each phylum-phylum comparison that passed thresholding, along with the ratio and test_variable.
phyla_ratio_plots: a ggplot2 plot of each phylum-phylum comparison that passed thresholding, with either a Wilcoxon or Kruskal-Wallis statistic depending on the number of groups.
stats_all: a data.frame containing all pair-wise Wilcoxon Rank Sum testing statistics for each comparison.
stats_signif: a filtered version of stats_all containing only those comparisons with a p-value < 0.5.

An example of the output .pdf file is shown here:

TOP OF PAGE

Next steps

Differential abundance analysis

Citation

If you used this repository in a publication, please mention its url.

Rights

Copyright (c) 2023 Mucosal Immunology Lab, Monash University, Melbourne, Australia.
Authors: M. Macowan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diversity and composition

Analysis of diversity and composition

Table of contents

Alpha diversity

Example: longitudinal experiment with two groups

Beta diversity

PERMANOVA of beta diversity

Taxonomic composition

Data preparation

Taxonomic barplot by group

Taxonomic composition by individual

Ratios between bacterial phyla

Arguments for `phyla_ratios()`

Function output

Next steps

Citation

Rights

Clone this wiki locally

Diversity and composition

Analysis of diversity and composition

Table of contents

Alpha diversity

Example: longitudinal experiment with two groups

Beta diversity

PERMANOVA of beta diversity

Taxonomic composition

Data preparation

Taxonomic barplot by group

Taxonomic composition by individual

Ratios between bacterial phyla

Arguments for phyla_ratios()

Function output

Next steps

Citation

Rights

Clone this wiki locally

Arguments for `phyla_ratios()`