diff --git a/vignettes/phyloseq-analysis.Rmd b/vignettes/phyloseq-analysis.Rmd index 86c9cdc2..f87384a6 100644 --- a/vignettes/phyloseq-analysis.Rmd +++ b/vignettes/phyloseq-analysis.Rmd @@ -74,7 +74,7 @@ theme_set(theme_bw()) See the [microbio_me_qiime tutorial](http://joey711.github.io/phyloseq/download-microbio.me.html) for more details and examples downloading and importing into phyloseq/R directly from this public database. ## Included Data -To facilitate testing and exploration of tools in phyloseq, this package includes example data from published studies. Many of the examples in this vignette use either the [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) or `enterotype` datasets as source data. The [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) data was described in a [2011 article in PNAS](http://www.pnas.org/content/early/2010/06/02/1000080107)([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)), and compares the microbial communities of 25 environmental samples and three known "mock communities" --- a total of 9 sample types --- at a depth averaging 3.1 million reads per sample. The [human enterotype dataset](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) was described in a [2011 article in Nature](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)), which compares the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. The authors further compare these microbial communities with the faecal communities of subjects from other studies, for a total of 280 faecal samples / subjects, and 553 genera. Sourcing data from different studies invariable leads to gaps in the data for certain variables, and this is easily handled by `R's core `NA features. +To facilitate testing and exploration of tools in phyloseq, this package includes example data from published studies. Many of the examples in this vignette use either the [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) or `enterotype` datasets as source data. The [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) data was described in a [2011 article in PNAS](http://www.pnas.org/content/early/2010/06/02/1000080107)([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)), and compares the microbial communities of 25 environmental samples and three known "mock communities" --- a total of 9 sample types --- at a depth averaging 3.1 million reads per sample. The [human enterotype dataset](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) was described in a [2011 article in Nature](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)), which compares the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. The authors further compare these microbial communities with the faecal communities of subjects from other studies, for a total of 280 faecal samples / subjects, and 553 genera. Sourcing data from different studies invariable leads to gaps in the data for certain variables, and this is easily handled by R's core `NA` features. Because this data is included in the package, the examples can easily be run on your own computer using the code shown in this vignette. The data is loaded into memory using the `data` command. Let's start by loading the [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) data. @@ -119,7 +119,7 @@ Alpha diversity estimators for samples in the *Global Patterns* dataset. Each pa For further details, see the [plot_tree tutorial](http://joey711.github.io/phyloseq/plot_tree-examples.html) -phyloseq also contains a method for easily plotting an annotated phylogenetic tree with information regarding the sample in which a particular taxa was observed, and optionally the number of individuals that were observed. +phyloseq also contains a method for easily plotting an annotated phylogenetic tree with information regarding the sample in which a particular taxa was observed, and optionally the number of individuals that were observed. For the sake of creating a readable tree, let's subset the data to just the [Chlamydiae](http://en.wikipedia.org/wiki/Chlamydiae) phylum, which consists of obligate intracellular pathogens and is present in only a subset of environments in this dataset. @@ -139,8 +139,8 @@ Phylogenetic tree representation of the Chlamydiae species in the microbiome sam For further details, see the [plot_bar tutorial](http://joey711.github.io/phyloseq/plot_bar-examples.html) -In the following example we use the included "enterotype" dataset ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)). - +In the following example we use the included "enterotype" dataset ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)). + ```{r} data(enterotype) ``` @@ -151,9 +151,9 @@ par(mar = c(10, 4, 4, 2) + 0.1) # make more room on bottom margin N <- 30 barplot(sort(taxa_sums(enterotype), TRUE)[1:N]/nsamples(enterotype), las=2) ``` -An example exploratory barplot using base `R graphics and the `taxa_sums and `nsamples functions. +An example exploratory barplot using base R graphics and the `taxa_sums` and `nsamples` functions. -Note that this first barplot is clipped at the `r N`th OTU. This was chosen because `ntaxa(enterotype) =``r ntaxa(enterotype)` OTUs would not be legible on the plot. As you can see, the relative abundances have decreased dramatically by the 10th-ranked OTU. +Note that this first barplot is clipped at the `r N`th OTU. This was chosen because `ntaxa(enterotype) =``r ntaxa(enterotype)` OTUs would not be legible on the plot. As you can see, the relative abundances have decreased dramatically by the 10th-ranked OTU. So what are these OTUs? In the `enterotype` dataset, only a single taxonomic rank type is present: ```{r} @@ -161,10 +161,10 @@ rank_names(enterotype) ``` This means the OTUs in this dataset have been grouped at the level of genera, and no other taxonomic grouping/transformation is possible without additional information (like might be present in a phylogenetic tree, or with further taxonomic classification analysis). -We need to know which taxonomic rank classifiers, if any, we have available to specify in the second barplot function in this example, `plot_bar(). We have already observed how quickly the abundance decreases with rank, so wo we will subset the enterotype dataset to the most abundant `N taxa in order to make the barplot legible on this page. +We need to know which taxonomic rank classifiers, if any, we have available to specify in the second barplot function in this example, `plot_bar()`. We have already observed how quickly the abundance decreases with rank, so we will subset the `enterotype` dataset to the ten most abundant taxa in order to make the barplot legible on this page. ```{r} -TopNOTUs <- names(sort(taxa_sums(enterotype), TRUE)[1:10]) +TopNOTUs <- names(sort(taxa_sums(enterotype), TRUE)[1:10]) ent10 <- prune_taxa(TopNOTUs, enterotype) print(ent10) ``` @@ -297,7 +297,7 @@ Scree plot of the PCoA used to create Figure 5 from the "Global Patterns" articl Next, we will reproduce Figure 5 from the "Global Patterns" article([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)), but separating the three axes into 2 plots using `plot_ordination()`. ```{r GPfig5ax1213} -(p12 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", color="SampleType") + +(p12 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", color="SampleType") + geom_point(size=5) + geom_path() + scale_colour_hue(guide = FALSE) ) (p13 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", axes=c(1, 3), color="SampleType") + geom_line() + geom_point(size=5) ) @@ -354,20 +354,20 @@ The correspondence analysis (CA) scree plot of the "Global Patterns" dataset. Now let's investigate how the samples behave on the first few CA axes. ```{r GPCA1234} -(p12 <- plot_ordination(GP, gpca, "samples", color="SampleType") + +(p12 <- plot_ordination(GP, gpca, "samples", color="SampleType") + geom_line() + geom_point(size=5) ) -(p34 <- plot_ordination(GP, gpca, "samples", axes=c(3, 4), color="SampleType") + +(p34 <- plot_ordination(GP, gpca, "samples", axes=c(3, 4), color="SampleType") + geom_line() + geom_point(size=5) ) ``` First 4 axes of Correspondence Analysis (CA) of the "Global Patterns" dataset ([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)). -A clear feature of these plots is that the feces and mock communities cluster tightly together, far away from all other samples on the first axis (CA1). The skin and tongue samples separate similarly, but on the second axis. Taken together, it appears that the first two axes are best explained by the separation of human-associated "environments" from the other non-human environments in the dataset, with a secondary separation of tongue and skin samples from feces. +A clear feature of these plots is that the feces and mock communities cluster tightly together, far away from all other samples on the first axis (CA1). The skin and tongue samples separate similarly, but on the second axis. Taken together, it appears that the first two axes are best explained by the separation of human-associated "environments" from the other non-human environments in the dataset, with a secondary separation of tongue and skin samples from feces. We will now investigate further this top-level structure of the data, using an additional feature of correspondence analysis that allows us to compare the relative contributions of individual taxa on the same graphical space: the "biplot". However, because we just displayed the position of samples in the ordination and there are often many thousands of OTUs, we will focus on creating an interpretable plot of the OTUs. For creating graphics that combine the two plots, try the `"biplot"` or `"split"` option for `type` in `plot_ordination()`. ```{r GPCAspecplot0} p1 <- plot_ordination(GP, gpca, "species", color="Phylum") -(p1 <- ggplot(p1$data, p1$mapping) + geom_point(size=5, alpha=0.5) + +(p1 <- ggplot(p1$data, p1$mapping) + geom_point(size=5, alpha=0.5) + facet_wrap(~Phylum) + scale_colour_hue(guide = FALSE) ) ``` @@ -386,20 +386,20 @@ These figures reveal some useful patterns and interesting outliers, but what if ```{r GPCAjitter0} library("reshape2") # Melt the species-data.frame, DF, to facet each CA axis separately -mdf <- melt(p1$data[, c("CA1", "CA2", "Phylum", "Family", "Genus")], +mdf <- melt(p1$data[, c("CA1", "CA2", "Phylum", "Family", "Genus")], id=c("Phylum", "Family", "Genus") ) # Select some special outliers for labelling LF <- subset(mdf, variable=="CA2" & value < -1.0) # build plot: boxplot summaries of each CA-axis, with labels -p <- ggplot(mdf, aes(Phylum, value, color=Phylum)) + - geom_boxplot() + - facet_wrap(~variable, 2) + +p <- ggplot(mdf, aes(Phylum, value, color=Phylum)) + + geom_boxplot() + + facet_wrap(~variable, 2) + scale_colour_hue(guide = FALSE) + - theme_bw() + + theme_bw() + theme( axis.text.x = element_text(angle = -90, vjust = 0.5) ) # Add the text label layer, and render ggplot graphic (p <- p + geom_text(data=subset(LF, !is.na(Family)), - mapping = aes(Phylum, value+0.1, color=Phylum, label=Family), + mapping = aes(Phylum, value+0.1, color=Phylum, label=Family), vjust=0, size=2)) ``` @@ -417,20 +417,20 @@ In this figure we've used the `threshold` parameter to omit all but phyla accoun ### Double Principle Coordinate Analysis (DPCoA) -Here is a quick example illustrating the use of Double Principal Coordinate Analysis (DPCoA~\cite{Pavoine2004523), using the using the `ordinate()` function in phyloseq, as well as the "biplot" option for `plot_ordination(). For a description that includes an applied example using the "enterotype" dataset and comparison with UniFrac/PCoA, see Fukuyama et al~\cite{fukuyama2012com. +Here is a quick example illustrating the use of Double Principal Coordinate Analysis (DPCoA~\cite{Pavoine2004523), using the using the `ordinate()` function in phyloseq, as well as the "biplot" option for `plot_ordination(). For a description that includes an applied example using the "enterotype" dataset and comparison with UniFrac/PCoA, see Fukuyama et al~\cite{fukuyama2012com. ```{r GPdpcoa01} # Perform ordination -GP.dpcoa <- ordinate(GP, "DPCoA") +GP.dpcoa <- ordinate(GP, "DPCoA") # Generate default ordination bi-plot -pdpcoa <- +pdpcoa <- plot_ordination( - physeq = GP, - ordination = GP.dpcoa, + physeq = GP, + ordination = GP.dpcoa, type="biplot", - color="SampleType", + color="SampleType", shape="Phylum") -# Adjust the shape scale manually +# Adjust the shape scale manually # to make taxa hollow and samples filled (advanced) shape.fac <- pdpcoa$data$Phylum man.shapes <- c(19, 21:25) @@ -456,7 +456,7 @@ plot_ordination(GP, GP.dpcoa, type="split", ### distance(): Central Distance Function -Many comparisons of microbiome samples, including the graphical model and the PCoA analysis, require a calculation for the relative dissimilarity/distance between one microbial community and another. The phyloseq-package provides a general "wrapper" function for calculating ecological distance matrices between the samples in an experiment. +Many comparisons of microbiome samples, including the graphical model and the PCoA analysis, require a calculation for the relative dissimilarity/distance between one microbial community and another. The phyloseq-package provides a general "wrapper" function for calculating ecological distance matrices between the samples in an experiment. `distance()` currently supports 43 method options, as well as user-provided arbitrary methods via an interface to vegan's `designdist()` function. Currrently only sample-wise distances are supported (the `type` argument), but eventually species-wise (OTU-wise) distances will be supported as well. In addition to supporting any of the method options to the three main distance functions of the vegan-package~\cite{veganpkg --- including the 14 distances of the `vegdist()` function and all 24 ecological distances reviewed in Koleff et al. 2003~\cite{Koleff2003JAE coded by `betadiver` --- `distance()` will eventually support many of the distance methods in the ade4-package~\cite{Dray:2007vs, and others. It also supports the Fast UniFrac distance function (Section~\ref{sec:unifrac) included in phyloseq as native R code, and a wrapper for retreiving the sample-distances from Double Principal Coordinate Analysis (DPCoA). @@ -464,7 +464,7 @@ The function takes a `phyloseq-class` object and an argument indicating the dist ```{r distancefun} data(esophagus) -distance(esophagus, "bray") +distance(esophagus, "bray") distance(esophagus, "wunifrac") # weighted UniFrac distance(esophagus, "jaccard") # vegdist jaccard distance(esophagus, "g") # betadiver method option "g" @@ -472,7 +472,7 @@ distance(esophagus, "g") # betadiver method option "g" ### UniFrac and weighted UniFrac -UniFrac is a recently-defined~\cite{Lozupone:2005gn and popular distance metric to summarize the difference between pairs of ecological communities. All UniFrac variants use a phylogenetic tree of the relationship among taxa as central information to calculating the distance between two samples/communities. An unweighted UniFrac distance matrix only considers the presence/absence of taxa, while weighted UniFrac accounts for the relative abundance of taxa as well as their phylogenetic distance. Prior to phyloseq, a non-parallelized, non-Fast implementation of the unweighted UniFrac was available in \R{ packages (`picante::unifrac`~\cite{Kembel:2010ft). In the phyloseq package we provide optionally-parallelized implementations of Fast UniFrac~\cite{Hamady:2009fk (both weighted and unweighted, with plans for additional UniFrac variants), all of which return a sample-wise distance matrix from any `phyloseq-class object that contains a phylogenetic tree component. +UniFrac is a recently-defined~\cite{Lozupone:2005gn and popular distance metric to summarize the difference between pairs of ecological communities. All UniFrac variants use a phylogenetic tree of the relationship among taxa as central information to calculating the distance between two samples/communities. An unweighted UniFrac distance matrix only considers the presence/absence of taxa, while weighted UniFrac accounts for the relative abundance of taxa as well as their phylogenetic distance. Prior to phyloseq, a non-parallelized, non-Fast implementation of the unweighted UniFrac was available in \R{ packages (`picante::unifrac`~\cite{Kembel:2010ft). In the phyloseq package we provide optionally-parallelized implementations of Fast UniFrac~\cite{Hamady:2009fk (both weighted and unweighted, with plans for additional UniFrac variants), all of which return a sample-wise distance matrix from any `phyloseq-class` object that contains a phylogenetic tree component. The following is an example calculating the UniFrac distance (both weighted and unweighted) matrix using the "esophagus" example dataset: @@ -494,7 +494,7 @@ data(GlobalPatterns) load(system.file("doc", "Unweighted_UniFrac.RData", package="phyloseq")) # Manually define color-shading vector based on sample type. colorScale <- rainbow(length(levels(get_variable(GlobalPatterns, "SampleType")))) -cols <- colorScale[get_variable(GlobalPatterns, "SampleType")] +cols <- colorScale[get_variable(GlobalPatterns, "SampleType")] GP.tip.labels <- as(get_variable(GlobalPatterns, "SampleType"), "character") # This is the actual hierarchical clustering call, specifying average-link clustering GP.hclust <- hclust(GPUF, method="average") @@ -510,7 +510,7 @@ One of our recommended approaches to this problem was described in McMurdie and Holmes (2014) [Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible](http://dx.plos.org/10.1371/journal.pcbi.1003531). PLoS Computational Biology. 10(4):e1003531 -Some reproducible demonstrations of this approach are included in +Some reproducible demonstrations of this approach are included in [the phyloseq extensions repository](http://joey711.github.io/phyloseq-extensions/extensions-index.html), the `phyloseq_to_deseq2` function, as well as a separate vignetted dedicated to this topic (phyloseq and DESeq2 on Colorectal Cancer Data).