adding updates to DE, GSEA, cancer cell ID, and Posit Setup pages. Al…

…so adding a lpwgs cnv image for cancercell
griffithlab · Apr 26, 2024 · beb1e6f · beb1e6f
1 parent d173b12
commit beb1e6f
Show file tree

Hide file tree

Showing 5 changed files with 220 additions and 46 deletions.
diff --git a/_posts/0008-04-01-DE_analysis.md b/_posts/0008-04-01-DE_analysis.md
@@ -24,7 +24,7 @@ library(Seurat)
 library(dplyr)
 library(EnhancedVolcano)
 library(presto)
-merged <- readRDS('processed_object_0409.rds')
+merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')
 ```
 
 ### Gene expression analysis for epithelial cells
@@ -83,14 +83,14 @@ FeaturePlot(merged, features = 'Epcam') +
 DimPlot(merged, group.by = 'immgen_singler_main', label = TRUE)
 ```
 
-While the plots generated by the above commands make it pretty clear that the clusters of interest are clusters 10 and 12, sometimes it is trickier to determine which cluster we are interested in solely from the UMAP as the clusters may be overlapping. In this case, a violin plot `VlnPlot` may be more helpful. Similar to `FeaturePlot`, `VlnPlot` also takes the Seurat object and `features` as input. It also requires a `group.by` argument that determines the x-axis groupings of the cells.  
+While the plots generated by the above commands make it pretty clear that the clusters of interest are clusters 8 and 12, sometimes it is trickier to determine which cluster we are interested in solely from the UMAP as the clusters may be overlapping. In this case, a violin plot `VlnPlot` may be more helpful. Similar to `FeaturePlot`, `VlnPlot` also takes the Seurat object and `features` as input. It also requires a `group.by` argument that determines the x-axis groupings of the cells.  
 To learn more about customizing a Violin plot, please refer to the [Seurat documentation](https://satijalab.org/seurat/reference/vlnplot)
 
 ```R
 VlnPlot(merged, group.by = 'seurat_clusters_res0.8', features = 'Epcam')
 ```
 
-Thus we were able to confirm that clusters 10 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. Let's use differential expression analysis to determine how these clusters differ from each other.
+Great! Looks like we can confirm that clusters 8 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. This is a good place to use differential expression analysis to determine how these clusters differ from each other.
 
 ### Differential expression for epithelial cells
 
@@ -99,7 +99,7 @@ We can begin by restricting the Seurat object to the cells we are interested in.
 ```R
 #set ident to seurat clusters metadata column and subset object to Epcam positive clusters
 merged <- SetIdent(merged, value = 'seurat_clusters_res0.8')
-merged_epithelial <- subset(merged, idents = c('10', '12'))
+merged_epithelial <- subset(merged, idents = c('8', '12'))
 
 #confirm that we have subset the object as expected visually using a UMAP
 DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
@@ -112,20 +112,20 @@ table(merged_epithelial$seurat_clusters_res0.8)
 
 Now we will use Seurat's `FindMarkers` function to carry out a differential expression analysis between both groups. `FindMarkers` also requires that we use `SetIdent` to change the default 'Ident' to the metadata column we want to use for our comparison. More information about `FindMarkers` is available [here](https://satijalab.org/seurat/reference/findmarkers).
 
-Note that here we use `FindMarkers` to compare clusters 10 and 12. The default syntax of `FindMarkers` requires that we provide each group of cells as `ident.1` and `ident.2`. The output of `FindMarkers` is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of `ident.1` with respect to `ident.2`. Therefore, genes upregulated in `ident.1` have positive log2FC, while those downregulated in `ident.1` have negative log2FC. Here, we also provide a `min.pct=0.25` argument so that we only test genes that are expressed in 25% of cells in either of the `ident.1` or `ident.2` groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. We also specify the `logfc.threshold=0.1` parameter, which ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the `min.pct` and `logfc.threshold` parameters can also result in the function running faster as they reduce the number of genes being tested.
+Note that here we use `FindMarkers` to compare clusters 8 and 12. The default syntax of `FindMarkers` requires that we provide each group of cells as `ident.1` and `ident.2`. The output of `FindMarkers` is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of `ident.1` with respect to `ident.2`. Therefore, genes upregulated in `ident.1` have positive log2FC, while those downregulated in `ident.1` have negative log2FC. Here, we also provide a `min.pct=0.25` argument so that we only test genes that are expressed in 25% of cells in either of the `ident.1` or `ident.2` groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. We also specify the `logfc.threshold=0.1` parameter, which ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the `min.pct` and `logfc.threshold` parameters can also result in the function running faster as they reduce the number of genes being tested.
 
 ```R
 #carry out DE analysis between both groups
 merged_epithelial <- SetIdent(merged_epithelial, value = "seurat_clusters_res0.8")
-epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "10", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 10 changes wrt cluster 12
+epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "8", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 8 changes wrt cluster 12
 ```
 On opening `epithelial_de` in your RStudio session, you'll see that it is a dataframe with the genes as rownames, and the following columns- `p_val`, `avg_log2FC`, `pct.1`, `pct.2`, `p_val_adj`. The p-values are dependent on the test used while running `FindMarkers`, and the adjusted p-value is based on the bonferroni correction test. `pct.1` and `pct.2` are the percentages of cells where the gene is detected in the `ident.1` and `ident.2` groups respectively. 
 
-Next we can subset this dataframe to only include DE genes that have a significant p-value, and then further subset the 'significant DE genes only' dataframe to the top 20 genes with the highest absolute log2FC. Looking a the absolute log2FC allows us to capture both, the upregulated and downregulated genes.
+Next we can subset this dataframe to only include DE genes that have a significant p-value, and then further subset the 'significant DE genes only' dataframe to the top 20 genes with the highest absolute log2FC. Looking at the absolute log2FC allows us to capture both, upregulated and downregulated genes.
 
 ```R
 #restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
-epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,]
+epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,] 
 
 #get the top 20 genes by fold change
 epithelial_de_sig %>%
@@ -134,7 +134,7 @@ epithelial_de_sig %>%
 
 `epithelial_de_sig_top20` is a dataframe that is restricted to the top20 most differentially expressed genes by log2FC.
 
-There are a few different ways we can visualize the differentially expressed genes. We'll start by with the Violin and Feature plots from before. We can also visualize DEs using a DotPlot that allows us to capture both the average expression of a gene and the % of cells expressing it. In addition to these in-built Seurat functions, we can also generate a volcano plot using the `EnhancedVolcano` package. For the volcano plot, we can use the unfiltered DE results as the function colors and labels genes based on the parameters (`pCutoff`, `FCcutoff`) we specify.
+There are a few different ways we can visualize the differentially expressed genes. We'll start with the Violin and Feature plots from before. We can also visualize DEs using a DotPlot that allows us to capture both the average expression of a gene and the % of cells expressing it. In addition to these in-built Seurat functions, we can also generate a volcano plot using the `EnhancedVolcano` package. For the volcano plot, we can use the unfiltered DE results as the function colors and labels genes based on parameters (`pCutoff`, `FCcutoff`) we specify.
 
 ```R
 #get list of top 20 DE genes for ease
@@ -154,25 +154,26 @@ EnhancedVolcano(epithelial_de,
   lab = rownames(epithelial_de),
   x = 'avg_log2FC',
   y = 'p_val_adj',
-  title = 'Cluster10 wrt Cluster 12',
+  title = 'Cluster8 wrt Cluster 12',
   pCutoff = 0.05,
   FCcutoff = 0.5,
   pointSize = 3.0,
   labSize = 5.0,
   colAlpha = 0.3)
 ```
 
-To find out how we can figure out what these genes mean, stay tuned! The next module on pathway analysis will help shed some light on that. Let's create a TSV file containing our DE results for use later on. We will need to rerun `FindMarkers` with slightly different parameters for this- we will change the `logfc.threshold` parameter to 0, as one of the pathway analysis tools requires all genes to be included in the analysis (more on that later).
+To find out how we can figure out what these genes mean, stay tuned! The next module on pathway analysis will help shed some light on that. For now, let's create a TSV file containing our DE results for use later on. We will need to rerun `FindMarkers` with slightly different parameters for this- we will change the `logfc.threshold` parameter to 0, as one of the pathway analysis tools requires all genes to be included in the analysis (more on that later).
+
 ```R
 #rerun FindMarkers
-epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "10", ident.2 = "12", min.pct=0.25, logfc.threshold=0)
+epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "8", ident.2 = "12", min.pct=0.25, logfc.threshold=0)
 #save this table as a TSV file
-write.table(x = epithelial_de_gsea, file = 'epithelial_de_gsea.tsv', sep='\t')
+write.table(x = epithelial_de_gsea, file = 'outdir_single_cell_rna/epithelial_de_gsea.tsv', sep='\t')
 ``` 
 
 ### Differential expression for T cells
 
-For the T cell focused analysis, we will start by subsetting our `merged` object to only have T cells, by combining the various T cell annotations from celltyping section. We'll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will `SetIdent` to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we'll doublecheck that the subsetting happened as we expected it to. 
+For the T cell focused analysis, we will ask how T cells from mice treated with ICB compare against T cells from mice with (some of) their T cells depleted treated with ICB (ICBdT). We will start by subsetting our `merged` object to only have T cells, by combining the various T cell annotations from celltyping section. We'll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will `SetIdent` to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we'll doublecheck that the subsetting happened as we expected it to. 
 
 **TODO UPDATE THESE BASED ON FINAL CELLTYPING**
 
@@ -194,7 +195,7 @@ table(merged$immgen_singler_main)
 table(merged_tcells$immgen_singler_main)
 ```
 
-Now we want to compare cells from mice treated with ICB vs mice with their T cells depleted treated with ICB (ICBdT). First, we need to distinguish the ICB and ICBdT cells from each other. Start by clicking on the object in RStudio and expand `meta.data` to get a snapshot of the columns and the what kind of data they hold. 
+Now we want to compare T cells from mice treated with ICB vs ICBdT. First, we need to distinguish the ICB and ICBdT cells from each other. Start by clicking on the object in RStudio and expand `meta.data` to get a snapshot of the columns and the what kind of data they hold. 
 
 Looks like `orig.ident` has information about the condition and replicates, but for the purposes of this DE analysis, we want a `meta.data` column that combines the replicates of each condition. So, we want to combine the replicates of each condition together into a single category.
 
@@ -205,12 +206,14 @@ unique(merged_tcells$orig.ident)
 #there are 6 possible values, 3 replicates for the ICB treatment condition, and 3 for the ICBdT condition
 #so we can combine "Rep1_ICB", "Rep3_ICB", "Rep5_ICB" to ICB, and "Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT" to ICBdT. 
 #first initialize a metadata column for experimental_condition
-merged@meta.data$experimental_condition <- NA
+merged_tcells@meta.data$experimental_condition <- NA
+
 #Now we can take all cells that are in each replicate-condition, and assign them to the appropriate condition
-merged@meta.data$experimental_condition[merged@meta.data$orig.ident %in% c("Rep1_ICB", "Rep3_ICB", "Rep5_ICB")] <- "ICB"
-merged@meta.data$experimental_condition[merged@meta.data$orig.ident %in% c("Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT")] <- "ICBdT"
+merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICB", "Rep3_ICB", "Rep5_ICB")] <- "ICB"
+merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT")] <- "ICBdT"
+
 #double check that the new column we generated makes sense (each replicate should correspond to its experimental condition)
-table(merged@meta.data$orig.ident, merged@meta.data$experimental_condition)
+table(merged_tcells@meta.data$orig.ident, merged_tcells@meta.data$experimental_condition)
 ```
 
 With the experimental conditions now defined, we can compare the T cells from both groups. We'll start by using `FindMarkers` using similar parameters to last time, and see how ICBdT changes with respect to ICB. Next, restrict the dataframe to significant genes only, and then look at the top 5 most upregulated and downregulated DE genes by log2FC. 
@@ -221,7 +224,7 @@ merged_tcells <- SetIdent(merged_tcells, value = "experimental_condition")
 tcells_de <- FindMarkers(merged_tcells, ident.1 = "ICBdT", ident.2 = "ICB", min.pct=0.25)
 
 #restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
-epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,]
+tcells_de_sig <- tcells_de[tcells_de$p_val_adj < 0.001,]
 
 #find the top 5 most downregulated genes
 tcells_de_sig %>%
@@ -236,10 +239,9 @@ The most downregulated gene in the ICBdT condition based on foldchange is Cd4. T
 Interestingly, for the list of genes that are upregulated, we see Cd8b1 show up. It could be interesting to see if the CD8 T cells' phenotype changes based on the treatment condition. So now, let's subset the object to CD8 T cells only, find DE genes to see how ICBdT CD8 T cells change compared to ICB CD8 T cells, and visualize these similar to before. 
 
 ```R
-#subset object to CD8 T cells
-cd8t_celltypes_names <- c("T cells (T.8EFF.OT1.12HR.LISOVA)","T cells (T.8EFF.OT1.24HR.LISOVA)","T cells (T.8EFF.OT1.48HR.LISOVA)","T cells (T.8EFF.OT1.D10LIS)","T cells (T.8EFF.OT1.D15.LISOVA)","T cells (T.8EFF.OT1.D45VSV)","T cells (T.8EFF.OT1.D5.VSVOVA)","T cells (T.8EFF.OT1.D8.VSVOVA)","T cells (T.8EFF.OT1.D8LISO)","T cells (T.8EFF.OT1.LISOVA)","T cells (T.8EFF.OT1.VSVOVA)","T cells (T.8EFF.OT1LISO)","T cells (T.8EFF.TBET-.OT1LISOVA)","T cells (T.8EFF.TBET+.OT1LISOVA)","T cells (T.8EFFKLRG1+CD127-.D8.LISOVA)","T cells (T.8MEM.OT1.D100.LISOVA)","T cells (T.8MEM.OT1.D106.VSVOVA)","T cells (T.8MEM.OT1.D45.LISOVA)","T cells (T.8Mem)","T cells (T.8MEM)","T cells (T.8MEMKLRG1-CD127+.D8.LISOVA)","T cells (T.8NVE.OT1)","T cells (T.8Nve)","T cells (T.8NVE)","T cells (T.8SP24-)","T cells (T.8SP69+)","T cells (T.CD8.1H)","T cells (T.CD8.24H)","T cells (T.CD8.48H)","T cells (T.CD8.5H)","T cells (T.CD8.96H)","T cells (T.CD8.CTR)")
-merged_tcells <- SetIdent(merged_tcells, value = 'immgen_singler_fine')
-merged_cd8tcells <- subset(merged_tcells, idents = cd8t_celltypes_names)
+#subset object to CD8 T cells. Since we already showed how to subset cells using the clusters earlier, this time we'll subset to CD8 T cells by selecting for cells with high 
+#expression of Cd8 genes and low expression of Cd4 genes
+merged_cd8tcells <- subset(merged_tcells, subset= Cd8b1 > 1 & Cd8a > 1 & Cd4 < 0.1)
 
 #carry out DE analysis between both groups
 merged_cd8tcells <- SetIdent(merged_cd8tcells, value = "experimental_condition")