From ac72c4ad43299136bb2341f3fe1a68240c54b7d7 Mon Sep 17 00:00:00 2001 From: AshKernow Date: Mon, 30 Sep 2024 16:28:56 +0100 Subject: [PATCH] Minor changes to load material and change venn code --- Markdowns/10_Data_Visualisation.Rmd | 432 ++++++++++++++++++ Markdowns/10_Data_Visualisation_solutions.Rmd | 105 +++++ 2 files changed, 537 insertions(+) create mode 100644 Markdowns/10_Data_Visualisation.Rmd create mode 100644 Markdowns/10_Data_Visualisation_solutions.Rmd diff --git a/Markdowns/10_Data_Visualisation.Rmd b/Markdowns/10_Data_Visualisation.Rmd new file mode 100644 index 0000000..41c59b3 --- /dev/null +++ b/Markdowns/10_Data_Visualisation.Rmd @@ -0,0 +1,432 @@ +--- +title: "Introduction to Bulk RNAseq data analysis" +subtitle: Visualisation of Differential Expression Results +date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' +output: + html_document: + toc: yes + toc_float: true + pdf_document: + toc: yes +bibliography: ref.bib +--- + +```{r setup, echo=FALSE} +knitr::opts_chunk$set(echo = TRUE, fig.width = 4, fig.height = 3) +``` + +# Visualisation + +In this section we will consider various ways in which we can visualise the +results of our differential expression analysis. We will use the `DESeq2` +results from the interaction model for Infected vs Uninfected at day 11. + +```{r packages, message=FALSE, warning=FALSE} +library(DESeq2) +library(tidyverse) + +results.d11 <- readRDS("RObjects/DESeqResults.interaction_d11.rds") +``` + + +## P-value histogram + +A quick and easy "sanity check" for our DE results is to generate a p-value +histogram. What we should see is a high bar at `0 - 0.05` and then a roughly +uniform tail to the right of this. There is a nice explanation of other possible +patterns in the histogram and what to do when you see them in [this +post](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/). + +```{r pvalHist, fig.align="center"} +hist(results.d11$pvalue) +``` + +## Shrinking the log2FoldChange + +`DESeq2` provides a function called `lfcShrink` that shrinks log-Fold Change +(LFC) estimates towards zero using an empirical Bayes procedure. The reason for +doing this is that there is high variance in the LFC estimates when counts are +low and this results in lowly expressed genes appearing to show greater +differences between groups than highly expressed genes. The `lfcShrink` method +compensates for this and allows better visualisation and ranking of genes. There +a few different shrinkage methods available, we will use the `ashr` method. + +To run the shrinkage algorithm we will also need the `DESeq2 Dataset` object +that was used to generate the results. We will also want to add gene symbols +to the shrunk results for easier interpretation. + +```{r shrinkLFC} +ddsObj <- readRDS("RObjects/DESeqDataSet.interaction.rds") +annot <- readRDS("RObjects/Ensembl_annotations.rds") + +shrink.11 <- lfcShrink(ddsObj, + res = results.d11, + type = "ashr") + +shrinkTab.11 <- as.data.frame(shrink.11) %>% + rownames_to_column("GeneID") %>% + left_join(annot, "GeneID") +``` + +Note that it is important to provide both the `DESeq2` object and the `DESeq2 +results` object to the `lfcShrink` function. It is not necessary to provide the +`DESeq2 results` object, we can instead just provide a contrast to create the +shrunk results for, however, if we do this the function will recalculate the +p-values and padj, but this time the alpha for independent filtering will be set +to the default of 0.1 and so the adjusted p-values will be different. + +Let's save the shrunk results to a file so we can use them later. + +```{r saveShrunkResults, eval=FALSE} +saveRDS(shrinkTab.33, file="results/Shrunk_Results.d33.rds") +``` + +## MA plots + +MA plots are a common way to visualize the results of a differential analysis. +We met them briefly towards the end of [the DESeq2 +session](08_DE_analysis_with_DESeq2.html). This plot shows the log-Fold Change +for each gene against its average expression across all samples in the two +conditions being contrasted. `DESeq2` has a handy function for plotting this. +Let's use it too compare the shrunk and un-shrunk fold changes. + +```{r maPlotDESeq2, fig.align="center", fig.width=10, fig.height=5} +par(mfrow=c(1,2)) +plotMA(results.d11, alpha=0.05) +plotMA(shrink.11, alpha=0.05) +``` + +The DESeq2 in `plotMA` function is fine for a quick look, but these inbuilt +functions aren't easy to customise, make changes to the way it looks or add +things such as gene labels. For this we would recommend using the ggplot +package. + +## Volcano Plots + +Another common visualisation is the +[*volcano plot*](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)) which +displays a measure of significance on the y-axis and fold-change on the x-axis. +We will use ggplot to create this. + +### A Brief Introduction to `ggplot2` + +The [`ggplot2`](http://ggplot2.tidyverse.org/) package has emerged as an +attractive alternative to the traditional plots provided by base R. A full +overview of all capabilities of the package is available from the +[cheatsheet](https://rstudio.github.io/cheatsheets/data-visualization.pdf). + +In brief:- + +- `shrinkTab.11` is our data frame containing the variables we wish to plot +- `aes` creates a mapping between the variables in our data frame to the +**_aes_**thetic properties of the plot: + + the x-axis will be mapped to `log2FoldChange` + + the y-axis will be mapped to the `-log10(pvalue)` +- `geom_point` specifies the particular type of plot we want (in this case a +scatter plot) +- `geom_text` allows us to add labels to some or all of the points + + see + [the cheatsheet](https://rstudio.github.io/cheatsheets/data-visualization.pdf) + for other plot types + +The real advantage of `ggplot2` is the ability to change the appearance of our +plot by mapping other variables to aspects of the plot. For example, we could +colour the points based on the sample group. To do this we can add metadata from +the `sampleinfo` table to the data. The colours are automatically chosen by +`ggplot2`, but we can specify particular values. For the volcano plot we will +colour according whether the gene has a pvalue below 0.05. We use a `-log10` +transformation for the y-axis; it's commonly used for p-values as it means that +more significant genes have a higher scale. + +```{r volcano11Plot, fig.align="center", fig.width=5, fig.height=5} +ggplot(shrinkTab.11, aes(x = log2FoldChange, y = -log10(pvalue))) + + geom_point(aes(colour = padj < 0.05), size=1) + + geom_text(data = ~top_n(.x, 1, wt = -padj), aes(label = Symbol)) + + labs(x = "log2(fold change)", y = "-log10(p-value)", colour = "FDR < 5%", + title = "Infected vs Uninfected (day 11)") +``` + +## Exercise 1 - Volcano plot for 33 days + +> We just made the volcano plot for the 11 days contrast, you will make the one +> for the 33 days contrast. + +> First load in the results for the 33 days contrast. + +```{r} +results.d33 <- readRDS("RObjects/DESeqResults.interaction_d33.rds") +``` + +> (a) +> Shrink the results for the 33 days contrast and add the annotation. + +```{r, echo=FALSE} +shrink.33 <- lfcShrink(ddsObj, + res = results.d33, + type = "ashr") + +shrinkTab.33 <- as.data.frame(shrink.33) %>% + rownames_to_column("GeneID") %>% + left_join(annot, "GeneID") +``` + +> (b) +> Create a plot with points coloured by padj < 0.05 similar to how we did in +> the first volcano plot + +```{r echo=FALSE, eval=FALSE} +ggplot(shrinkTab.33, aes(x = log2FoldChange, y = -log10(pvalue))) + + geom_point(aes(colour = padj < 0.05), size=1) + + labs(x = "log2(fold change)", y = "-log10(p-value)", colour = "FDR < 5%", + title = "Infected vs Uninfected (day 33)") +``` + +> (c) +> Compare these two volcano plots, what differences can you see between the two contrasts? + + +## Exercise 2 - MA plot for day 33 with ggplot2 + +> For this exercise create an MA plot for day 33 like the ones we plotted with +> `plotMA` from **DESeq2** but this time using ggplot2. +> +> The x-axis should be the log2 of the mean gene expression across all +> samples, and the y-axis should be the log2 of the fold change between Infected +> and Uninfected. + +## Strip Charts for gene expression + +Before following up on the DE genes with further lab work, a recommended *sanity +check* is to have a look at the expression levels of the individual samples for +the genes of interest. We can quickly look at grouped expression by using +`plotCounts` function of `DESeq2` to retrieve the normalised expression values +from the `ddsObj` object and then plotting with `ggplot2`. + +We are going investigate the Il10ra gene: + + +```{r plotGeneCounts} +geneID <- filter(shrinkTab.11, Symbol=="Il10ra") %>% pull(GeneID) + +plotCounts(ddsObj, + gene = geneID, + intgroup = c("TimePoint", "Status", "Replicate"), + returnData = T) %>% + ggplot(aes(x=Status, y=log2(count))) + + geom_point(aes(fill=Replicate), shape=21, size=2) + + facet_wrap(~TimePoint) + + expand_limits(y=0) + + labs(title = "Normalised counts - Interleukin 10 receptor, alpha") +``` + +## Exercise 3 + +> For this exercise create another strip chart for the gene Jchain. + + +## Venn Diagram + +In the paper you may notice they have presented a Venn diagram of the results. + +![](../images/Venn.png) + +We will recreate it with our analysis. To do this we are using the package +`ggvenn` which is an extension to `ggplot` from Linlin Yan. + +```{r} +library(ggvenn) +``` + +We want to plot four "sets" on the venn diagram: + +* Significantly up-regulated on day 11 +* Significantly down-regulated on day 11 +* Significantly up-regulated on day 33 +* Significantly down-regulated on day 33 + +Each comprising genes at that statistically significant at a 5% FDR level for the +respective contrast. + +There are two ways of providing the data to `ggvenn`. The first is to provide a +table with features (genes) in the rows and the sets (contrasts) in the columns, and +`TRUE` or `FALSE` in the cells to indicate whether the features is in that set. +For our data the table would look like this: + +```{r echo=FALSE} +tibble(Geneid=rownames(results.d11)) %>% + mutate(Upregulated_11 = results.d11$padj < 0.05 & + !is.na(results.d11$padj) & + results.d11$log2FoldChange > 0) %>% + mutate(Downregulated_11 = results.d11$padj < 0.05 & + !is.na(results.d11$padj) & + results.d11$log2FoldChange < 0) %>% + mutate(Upregulated_33 = results.d33$padj < 0.05 & + !is.na(results.d33$padj) & + results.d33$log2FoldChange > 0) %>% + mutate(Downregulated_33 = results.d33$padj < 0.05 & + !is.na(results.d33$padj) & + results.d33$log2FoldChange < 0) +``` + +The second is to provide a list with one element for each set. Each element is then +a vector of the features in that set. For our data this would look like this: + +```{r echo=FALSE} +getGenes <- function(shrTab, direction = "up") { + sign <- ifelse(direction == "up", 1, -1) + shrTab %>% + filter(padj < 0.05) %>% + filter(sign * log2FoldChange > 0) %>% + pull("GeneID") +} + +myList <- list(Upregulated_11 = getGenes(shrinkTab.11, "up"), + Downregulated_11 = getGenes(shrinkTab.11, "down"), + Upregulated_33 = getGenes(shrinkTab.33, "up"), + Downregulated_33 = getGenes(shrinkTab.33, "down")) +str(myList) +``` + +We will use the list option as the code for builing the list is more concise. + +The code for building each list is basically the same with a couple of minor changes, +rather the repeating the code we can create a function to do this for us. + +To build up the function, first, let's see how we would do this for the up-regulated +genes on day 11. + +```{r geneListUpRegd11} +shrinkTab.11 %>% + filter(padj < 0.05) %>% + filter(log2FoldChange > 0) %>% + pull("GeneID") +``` + +The functions is just a generalisation of this code. We want to be able to do the same +using different tables (day 11 and day 33), and we also need to be able to get the +up- or down-regulated genes. We can do this by passing the table and the direction +as arguments to the function. To change the direction of the regulation, we can +leave the boolean filter as `log2FoldChange > 0` and multiply the `log2FoldChange` +by 1 or -1 depending on the direction we want. + +```{r} +getGenes <- function(shrTab, direction = "up") { + sign <- ifelse(direction == "up", 1, -1) + shrTab %>% + filter(padj < 0.05) %>% + filter(sign * log2FoldChange > 0) %>% + pull("GeneID") +} + +vennList <- list(Upregulated_11 = getGenes(shrinkTab.11, "up"), + Downregulated_11 = getGenes(shrinkTab.11, "down"), + Upregulated_33 = getGenes(shrinkTab.33, "up"), + Downregulated_33 = getGenes(shrinkTab.33, "down")) + +str(vennList) +``` + +Now we just pass the list to the `ggvenn` function. + +```{r vennPlot} +ggvenn(vennList, set_name_size = 4) +``` + +## Heatmap + +We're going to use the package `ComplexHeatmap` [@Gu2016]. We'll also use +`circlize` to generate a colour scale [@Gu2014]. + +```{r complexHeatmap, message=F} +library(ComplexHeatmap) +library(circlize) +``` + +We can't plot the entire data set, let's just select the top 300 by false discovery rate (`padj`). We'll +want to use normalised expression values, so we'll use the `vst` function. + +```{r selectGenes} +# get the top genes +sigGenes <- shrinkTab.11 %>% + top_n(300, wt = -padj) %>% + pull("GeneID") + +# filter the data for the top 300 by padj +plotDat <- vst(ddsObj)[sigGenes,] %>% + assay() +``` + +The range expression values for different genes can vary widely. Some genes will +have very high expression. Our heatmap is going to be coloured according to gene +expression. If we used a colour scale from 0 (no expression) to the maximum +expression, the scale will be dominated by our most extreme genes and it will be +difficult to discern any difference between most of the genes. + +To overcome this we will z-scale the counts. This scaling method results in +values for each that show the number of standard deviations the gene expression +is from the mean for that gene across all the sample - the mean will be '0', '1' +means 1 standard deviation higher than the mean, '-1' means 1 standard deviation +lower than the mean. + +```{r z-scale} +z.mat <- t(scale(t(plotDat), center=TRUE, scale=TRUE)) +``` + +```{r colourScale} +# colour palette +myPalette <- c("royalblue3", "ivory", "orangered3") +myRamp <- colorRamp2(c(-2, 0, 2), myPalette) +``` + +```{r heatmap, fig.width=5, fig.height=8} +Heatmap(z.mat, name = "z-score", + col = myRamp, + show_row_names = FALSE) +``` + +we can also split the heat map into clusters and add some annotation. + +```{r splitHeatmap, fig.width=5, fig.height=8} +ha1 = HeatmapAnnotation(df = colData(ddsObj)[,c("Status", "TimePoint")]) + +Heatmap(z.mat, name = "z-score", + col = myRamp, + show_row_name = FALSE, + split=3, + rect_gp = gpar(col = "lightgrey", lwd=0.3), + top_annotation = ha1) +``` + +Whenever we teach this session several student always ask how to set the +colours of the bars at the top of the heatmap. This is shown below. + +```{r ColouredsplitHeatmap, fig.width=5, fig.height=8} +ha1 = HeatmapAnnotation(df = colData(ddsObj)[,c("Status", "TimePoint")], + col = list(Status = c("Uninfected" = "darkgreen", + "Infected" = "palegreen"), + TimePoint = c("d11" = "lightblue", + "d33" = "darkblue"))) + +Heatmap(z.mat, name = "z-score", + col = myRamp, + show_row_name = FALSE, + split=3, + rect_gp = gpar(col = "lightgrey", lwd=0.3), + top_annotation = ha1) +``` + + +```{r saveEnvironment, eval=FALSE} +saveRDS(results.d11, file="results/Annotated_Results.d11.rds") +saveRDS(shrinkTab.11, file="results/Shrunk_Results.d11.rds") +saveRDS(results.d33, file="results/Annotated_Results.d33.rds") +saveRDS(shrinkTab.33, file="results/Shrunk_Results.d33.rds") +``` + +```{r saveObjects, eval=TRUE, echo=FALSE} +# pre-processed course files +saveRDS(results.d11, file="RObjects/Annotated_Results.d11.rds") +saveRDS(shrinkTab.11, file="RObjects/Shrunk_Results.d11.rds") +# saveRDS(results.d33, file="RObjects/Annotated_Results.d33.rds") +saveRDS(shrinkTab.33, file="RObjects/Shrunk_Results.d33.rds") +``` diff --git a/Markdowns/10_Data_Visualisation_solutions.Rmd b/Markdowns/10_Data_Visualisation_solutions.Rmd new file mode 100644 index 0000000..8213071 --- /dev/null +++ b/Markdowns/10_Data_Visualisation_solutions.Rmd @@ -0,0 +1,105 @@ +--- +title: "Introduction to Bulk RNAseq data analysis" +author: "Abbi Edwards" +date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' +output: + html_document: default + pdf_document: default +subtitle: Annotation and Visualisation of Differential Expression Results - Solutions +--- + +```{r setup, echo=FALSE, cache=FALSE} +knitr::opts_chunk$set(echo = TRUE, fig.width = 4, fig.height = 3) +knitr::opts_knit$set(root.dir = here::here("Course_Materials")) +``` + +```{r packages, include=FALSE} +library(AnnotationHub) +library(AnnotationDbi) +library(ensembldb) +library(DESeq2) +library(tidyverse) +``` + +```{r prepareData, echo=FALSE, message=FALSE, warning=FALSE} +# First load data and annotations +ddsObj.interaction <- readRDS("RObjects/DESeqDataSet.interaction.rds") +results.interaction.11 <- readRDS("RObjects/DESeqResults.interaction_d11.rds") +results.interaction.33 <- readRDS("RObjects/DESeqResults.interaction_d33.rds") +``` + + +## Exercise 1 - Volcano plot for 33 days + +Now it's your turn! We just made the volcano plot for the 11 days contrast, you will make the one for the 33 days contrast. + +If you haven't already make sure you load in our data and annotation. You can copy and paste the code below. + +```{r load} +# First load data and annotations +results.interaction.33 <- readRDS("RObjects/DESeqResults.interaction_d33.rds") +ensemblAnnot <- readRDS("RObjects/Ensembl_annotations.rds") +``` + +> (a) +> Shrink the results for the 33 days contrast. + +```{r shrink} +#Shrink our values +ddsShrink.33 <- lfcShrink(ddsObj.interaction, + res = results.interaction.33, + type = "ashr") + +shrinkTab.33 <- as.data.frame(ddsShrink.33) %>% + rownames_to_column("GeneID") %>% + left_join(ensemblAnnot, "GeneID") +``` + +> (b) +> Create a plot with points coloured by P-value < 0.05 similar to how we did in +> the first volcano plot + +```{r plotVol} +ggplot(shrinkTab.33, aes(x = log2FoldChange, y = -log10(pvalue))) + + geom_point(aes(colour = padj < 0.05), size = 1) + + labs(x = "log2(Fold Change)", y = "-log10(p-value)", colour = "FDR < 5%", + title = "Infected vs Uninfected (day 33)") +``` + + +## Exercise 2 - MA plot for day 33 with ggplot2 + +> For this exercise create an MA plot for day 33 like the ones we plotted with +> `plotMA` from **DESeq2** but this time using ggplot2. +> +> The x-axis (M) should be the log2 of the mean gene expression across all +> samples, and the y-axis should be the log2 of the fold change between Infected +> and Uninfected. + +```{r plotMA} +ggplot(shrinkTab.33, aes(x = log2(baseMean), y = log2FoldChange)) + + geom_point(aes(colour = padj < 0.05), size = 1) + + scale_y_continuous(limit = c(-4, 4), oob = scales::squish) + + labs(x = "log2(Mean Expression)", y = "log2(Fold Change)", colour = "FDR < 5%", + title = "Infected vs Uninfected (day 33)") +``` + +## Exercise 3 - Strip Chart + +> For this exercise create another strip chart for the gene Jchain. + +```{r Exercise3} +geneID_33 <- shrinkTab.33 %>% + filter(Symbol == "Jchain") %>% + pull(GeneID) + +plotCounts(ddsObj.interaction, + gene = geneID_33, + intgroup = c("TimePoint", "Status", "Replicate"), + returnData = T) %>% + ggplot(aes(x = Status, y = log2(count))) + + geom_point(aes(fill = Replicate), shape = 21, size = 2) + + facet_wrap(~ TimePoint) + + expand_limits(y = 0) + + labs(title = "Normalised counts - Immunoglobulin Joining Chain") +``` \ No newline at end of file