From 987fba1aa387a7dde5f78353e2008e636ba81d23 Mon Sep 17 00:00:00 2001
From: Malachi Griffith <malachig@gmail.com>
Date: Fri, 26 Apr 2024 10:49:45 -0500
Subject: [PATCH] flesh out the DEseq2 viz section a bit more

---
 _posts/0003-04-02-DE_Visualization-DESeq2.md | 92 ++++++++++++++++----
 1 file changed, 77 insertions(+), 15 deletions(-)

diff --git a/_posts/0003-04-02-DE_Visualization-DESeq2.md b/_posts/0003-04-02-DE_Visualization-DESeq2.md
index 30d4baf..f3f2ee0 100644
--- a/_posts/0003-04-02-DE_Visualization-DESeq2.md
+++ b/_posts/0003-04-02-DE_Visualization-DESeq2.md
@@ -17,14 +17,17 @@ date: 0003-04-02
 
 
 ### Differential Expression Visualzation
-In this section we will be going over some basic visualizations for the DESeq2 results generated in the "Differential Expression with DESeq2" section of this course. Our goal is to quickly obtain some interpretable results using built-in visualization functions from DESeq2 or recommended packages. For a more in-depth overview of DESeq2 and the results students should view the DESeq2 vignette.
+In this section we will be going over some basic visualizations of the DESeq2 results generated in the "Differential Expression with DESeq2" section of this course. Our goal is to quickly obtain some interpretable results using built-in visualization functions from DESeq2 or recommended packages. For a very extensive overview of DESeq2 and how to visualize and interpret the results, refer to the DESeq2 vignette.
 
 #### Setup
 If it is not already in your R environment, load the DESeqDataSet object and the results table into the R environment.
 
 ```R
 # set the working directory
-setwd('/clount/project')
+setwd('/cloud/project/outdir')
+
+# view the contents of this directory
+dir()
 
 # load libs
 library(DESeq2)
@@ -32,14 +35,22 @@ library(data.table)
 library(pheatmap)
 
 # Load in the DESeqDataSet object
-dds <- readRDS('outdir/dds.rds')
+dds <- readRDS('dds.rds')
+
+# Load in the results objects before and after shrinkage
+res <- readRDS('res.rds')
+resLFC <- readRDS('resLFC.rds')
 
-# Load in the results file
-deGeneResultSorted <- fread('outdir/deGeneResultSorted')
+# Load in the final results file with all sorted DE results
+deGeneResultSorted <- fread('DE_all_genes_DESeq2.tsv')
 ```
 
 #### MA-plot before LFC shrinkage
-MA-plots were originally used in microarray data where M is the the log ratio and A is the mean average of counts. These types of plots are still usefull in RNAseq DE experiments with two conditions, as they can immediately give us information on the number of signficantly differentially expressed genes, the ratio of up vs down regulated genes, and any outliers. To interpret these plots it's important to keep a couple of things in mind. The Y axis (M) is the log2 fold change between the two conditions tested for, a higher fold-change indicates more variability between condition A and condition B. The X axis (A) is a measure of hits on a gene, so as you go higher on on the X axis you are looking at regions which have higher totals of aligned reads, in other words the gene is "more" expressed overall. Using the build in `plotMA` function from DESeq2 we also see that the genes are color coded by a significance threshold. Genes with higher expression values and higher fold-changes are more often significant as one would expect.
+MA-plots were originally used to evaluate microarray expression data where M is the log ratio and A is the mean average (both based on scanned intensity measurements from the microarray). 
+
+These types of plots are still usefull in RNAseq DE experiments with two conditions, as they can immediately give us information on the number of signficantly differentially expressed genes, the ratio of up vs down regulated genes, and any outliers. To interpret these plots it is important to keep a couple of things in mind. The Y axis (M) is the log2 fold change between the two conditions tested, a higher fold-change indicates greater difference between condition A and condition B. The X axis (A) is a measure of read alignment to a gene, so as you go higher on on the X axis you are looking at regions which have higher totals of aligned reads, in other words the gene is "more" expressed overall (with the caveat that gene length is not being taken into account by raw read counts here). 
+
+Using the build in `plotMA` function from DESeq2 we also see that the genes are color coded by a significance threshold. Genes with higher expression values and higher fold-changes are more often significant as one would expect.
 
 ```R
 # use DESeq2 built in MA-plot function
@@ -48,39 +59,90 @@ plotMA(res, ylim=c(-2, 2))
 ```
 
 #### MA-plot after LFC shrinkage
-When we ran DESeq2 we obtained two results, one with and without log-fold change shrinkage. When you have genes with low hits you can get some very large fold changes. For example 1 hit on a gene vs 6 hits on a gene is a 6x fold change. This high level of variance though is probably quantifying noise instead of real biology. Running `plotMA` on our results where we applied an algorithm for log fold change shrinkage we can see that this "effect" is somewhat controlled for.
+When we ran DESeq2 we obtained two results, one with and one without log-fold change shrinkage. When you have genes with low hits you can get some very large fold changes. For example 1 hit on a gene vs 6 hits on a gene is a 6x fold change. This high level of variance though is probably quantifying noise instead of real biology. Running `plotMA` on our results where we applied an algorithm for log fold change shrinkage we can see that this "effect" is somewhat controlled for.
 
 ```R
 # ma plot
 plotMA(resLFC, ylim=c(-2,2))
 ```
 
+The effect is very subtle here due to the focused nature of our dataset (chr22 genes onle), but if you toggle between the two plots and look in the upper left and bottom left corners you can see some fold change values are moving closer to 0.
+
 #### Viewing individual gene counts between two conditions
-Often it may be usefull to view the normalized counts for a gene amongst our samples. DESeq2 provides a built in function for that which works off of the dds object. Here we view SEPT3 which we can see in our DE output is significantly higher in the UHR cohort. This is usefull as we can see the per-sample distribtuion of our corrected counts, we can immediately determine if there are any outliers within each group and investigate further if need be.
+Often it may be useful to view the normalized counts for a gene amongst our samples. DESeq2 provides a built in function for that which works off of the dds object. Here we view SEPT3 which we can see in our DE output is significantly higher in the UHR cohort. This is useful as we can see the per-sample distribution of our corrected counts, we can immediately determine if there are any outliers within each group and investigate further if need be.
 
 ```R
 # view SEPT3 normalized counts
-plotCounts(dds, gene='ENSG00000100167', intgroup = 'Disease')
+plotCounts(dds, gene='ENSG00000100167', intgroup = 'Condition')
 
 # view PRAME normalized counts
-plotCounts(dds, gene='ENSG00000185686', intgroup = 'Disease')
+plotCounts(dds, gene='ENSG00000185686', intgroup = 'Condition')
 ```
 
 # Viewing pairwise sample clustering
-It may often be usefull to view inter-sample relatedness, how similar are disimilar are samples are in relation to one another. While not part of the DESeq2 package, there is a convenient library where we can easily construct a hierarchically clustered heatmap from our DESeq2 data. It should be noted that when doing any sort of distance calculation the count data should be transformed using `vst()` or `rlog()`, this can be done directly on the dds object.
+It may often be useful to view inter-sample relatedness. In other words, how similar or disimilar samples are to one another overall. While not part of the DESeq2 package, there is a convenient library that can easily construct a hierarchically clustered heatmap from our DESeq2 data. It should be noted that when doing any sort of distance calculation the count data should be transformed using `vst()` or `rlog()`, this can be done directly on the dds object.
 
 ```R
 # note that we use rlog because we don't have a large number of genes, for a typical DE experiment with 1000's of genes use the vst() function
 rld <- rlog(dds, blind=F)
 
-# compute sample distances
+# view the structure of this object
+rld
+
+# compute sample distances (the dist function uses the euclidean distance metric by default)
+# in this command we will pull the rlog transformed data ("regularized" log2 transformed, see ?rlog for details) using "assay"
+# then we transpose that data using t()
+# then we calculate distance values using dist() 
+# the distance is calculated for each vector of sample gene values, in a pairwise fashion comparing all samples
+
+# view the first few lines of raw data
+head(assay(dds))
+
+# see the rlog transformed data
+head(assay(rld))
+
+# see the impact of transposing the matrix
+t(assay(rld))[1:6,1:5]
+
+# see the distance values
+dist(t(assay(rld)))
+
+# put it all together and store the result
 sampleDists <- dist(t(assay(rld)))
+
+# convert the distance result to a matrix
 sampleDistMatrix <- as.matrix(sampleDists)
 
+# view the distance numbers directly in the pairwise distance matrix
+head(sampleDistMatrix)
+
 # construct clustered heatmap, important to use the computed sample distances for clustering
-pheatmap(sampleDistMatrix,
-         clustering_distance_rows=sampleDists,
-         clustering_distance_cols=sampleDists)
+pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists)
+```
+
+Instead of a distance metric we could also use a similarity metric such as a Peason correlation
+
+There are many correlation and distance options:
+
+Correlation: "pearson", "kendall", "spearman"
+Distance: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"
+
+```R
+sampleCorrs <- cor(assay(rld), method="pearson")
+sampleCorrMatrix <- as.matrix(sampleCorrs)
+head(sampleCorrMatrix)
+
+pheatmap(sampleCorrMatrix)
+
 ``` 
 
+Instead of boiling all the gene count data for each sample down to a distance metric you can 
+get a similar sense of the pattern by just visualizing all the genes at once
+
+```R
+
+# because there are so many gene we choose not to display them
+pheatmap(mat=t(assay(rld)), show_colnames = FALSE)
+
+```