minor edits and updates based on celltyping

griffithlab · Apr 27, 2024 · 6c70518 · 6c70518
1 parent 521c6e3
commit 6c70518
Show file tree

Hide file tree

Showing 4 changed files with 27 additions and 53 deletions.
diff --git a/_posts/0008-04-01-DE_analysis.md b/_posts/0008-04-01-DE_analysis.md
@@ -18,7 +18,6 @@ Secondly, we will compare the T cell populations of mice treated with ICB therap
 
 Read-in the saved seurat object from the previous step if it is not already loaded in your current R session.
 
-**TODO SEE WHAT THIS NEW OBJECT IS NAMED AND WHERE IT'S KEPT FROM PREVIOUS STEPS**
 ```R
 library(Seurat)
 library(dplyr)
@@ -29,42 +28,6 @@ merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')
 
 ### Gene expression analysis for epithelial cells
 
-**TODO we may take care of the step below in a previous section**
-
-As of now the various replicates are in their own layers. They need to be merged into 1 single layer for further analysis. Also add (draft) SingleR labels to the object.
-
-```R
-library(celldex)
-library(SingleR)
-
-merged
-merged <- JoinLayers(merged)
-merged
-
-#load singler immgen reference
-ref_immgen <- celldex::ImmGenData()
-#generate predictions for our seurat object
-predictions_main = SingleR(test = GetAssayData(merged), 
-                      ref = ref_immgen,
-                      labels = ref_immgen$label.main)
-
-predictions_fine = SingleR(test = GetAssayData(merged), 
-                           ref = ref_immgen,
-                           labels = ref_immgen$label.fine)
-
-#add main labels to object
-merged[['immgen_singler_main']] = rep('NA', ncol(merged))
-merged$immgen_singler_main[rownames(predictions_main)] = predictions_main$labels
-
-#add fine labels to object
-merged[['immgen_singler_fine']] = rep('NA', ncol(merged))
-merged$immgen_singler_fine[rownames(predictions_fine)] = predictions_fine$labels
-
-
-```
-
-Compare the number of layers present before and after merging. `JoinLayers` is an important step because many of the DE functions use the log-normalized data (held in the `data` layer) instead of the data kept in the `scale.data` layer.  
-
 Use Seurat's `FeaturePlot` function to color each cell by its Epcam expression on a UMAP.
 `FeaturePlot` requires at least 2 arguments- the seurat object, and the 'feature' you want to plot (where a 'feature' can be a gene, PC scores, any of the metadata columns, etc.). To customize the `FeaturePlot`, please refer to Seurat's documentation [here](https://satijalab.org/seurat/reference/featureplot)
 
@@ -76,7 +39,6 @@ While there are some Epcam positive cells scattered on the UMAP, there appear to
 
 There are a few different ways to go about identifying what those clusters are. We can start by trying to use the `DimPlot` plotting function from before along with the `FeaturePlot` function. Separating plots by the `+` symbol allows us to plot multiple plots side-by-side.
 
-**TODO UPDATE GROUP.BY ARGUMENT BASED ON WHAT THE CELLTYPE COLUMN IS CALLED**
 ```R
 DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
 FeaturePlot(merged, features = 'Epcam') + 
@@ -90,7 +52,7 @@ To learn more about customizing a Violin plot, please refer to the [Seurat docum
 VlnPlot(merged, group.by = 'seurat_clusters_res0.8', features = 'Epcam')
 ```
 
-Great! Looks like we can confirm that clusters 8 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. This is a good place to use differential expression analysis to determine how these clusters differ from each other.
+Great! Looks like we can confirm that clusters 9 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. This is a good place to use differential expression analysis to determine how these clusters differ from each other.
 
 ### Differential expression for epithelial cells
 
@@ -99,7 +61,7 @@ We can begin by restricting the Seurat object to the cells we are interested in.
 ```R
 #set ident to seurat clusters metadata column and subset object to Epcam positive clusters
 merged <- SetIdent(merged, value = 'seurat_clusters_res0.8')
-merged_epithelial <- subset(merged, idents = c('8', '12'))
+merged_epithelial <- subset(merged, idents = c('9', '12'))
 
 #confirm that we have subset the object as expected visually using a UMAP
 DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
@@ -112,12 +74,12 @@ table(merged_epithelial$seurat_clusters_res0.8)
 
 Now we will use Seurat's `FindMarkers` function to carry out a differential expression analysis between both groups. `FindMarkers` also requires that we use `SetIdent` to change the default 'Ident' to the metadata column we want to use for our comparison. More information about `FindMarkers` is available [here](https://satijalab.org/seurat/reference/findmarkers).
 
-Note that here we use `FindMarkers` to compare clusters 8 and 12. The default syntax of `FindMarkers` requires that we provide each group of cells as `ident.1` and `ident.2`. The output of `FindMarkers` is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of `ident.1` with respect to `ident.2`. Therefore, genes upregulated in `ident.1` have positive log2FC, while those downregulated in `ident.1` have negative log2FC. Here, we also provide a `min.pct=0.25` argument so that we only test genes that are expressed in 25% of cells in either of the `ident.1` or `ident.2` groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. We also specify the `logfc.threshold=0.1` parameter, which ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the `min.pct` and `logfc.threshold` parameters can also result in the function running faster as they reduce the number of genes being tested.
+Note that here we use `FindMarkers` to compare clusters 9 and 12. The default syntax of `FindMarkers` requires that we provide each group of cells as `ident.1` and `ident.2`. The output of `FindMarkers` is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of `ident.1` with respect to `ident.2`. Therefore, genes upregulated in `ident.1` have positive log2FC, while those downregulated in `ident.1` have negative log2FC. Here, we also provide a `min.pct=0.25` argument so that we only test genes that are expressed in 25% of cells in either of the `ident.1` or `ident.2` groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. We also specify the `logfc.threshold=0.1` parameter, which ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the `min.pct` and `logfc.threshold` parameters can also result in the function running faster as they reduce the number of genes being tested.
 
 ```R
 #carry out DE analysis between both groups
 merged_epithelial <- SetIdent(merged_epithelial, value = "seurat_clusters_res0.8")
-epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "8", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 8 changes wrt cluster 12
+epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "8", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 9 changes wrt cluster 12
 ```
 On opening `epithelial_de` in your RStudio session, you'll see that it is a dataframe with the genes as rownames, and the following columns- `p_val`, `avg_log2FC`, `pct.1`, `pct.2`, `p_val_adj`. The p-values are dependent on the test used while running `FindMarkers`, and the adjusted p-value is based on the bonferroni correction test. `pct.1` and `pct.2` are the percentages of cells where the gene is detected in the `ident.1` and `ident.2` groups respectively. 
 
@@ -154,7 +116,7 @@ EnhancedVolcano(epithelial_de,
   lab = rownames(epithelial_de),
   x = 'avg_log2FC',
   y = 'p_val_adj',
-  title = 'Cluster8 wrt Cluster 12',
+  title = 'Cluster9 wrt Cluster 12',
   pCutoff = 0.05,
   FCcutoff = 0.5,
   pointSize = 3.0,
@@ -166,7 +128,7 @@ To find out how we can figure out what these genes mean, stay tuned! The next mo
 
 ```R
 #rerun FindMarkers
-epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "8", ident.2 = "12", min.pct=0.25, logfc.threshold=0)
+epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "9", ident.2 = "12", min.pct=0.25, logfc.threshold=0)
 #save this table as a TSV file
 write.table(x = epithelial_de_gsea, file = 'outdir_single_cell_rna/epithelial_de_gsea.tsv', sep='\t')
 ``` 
@@ -175,8 +137,6 @@ write.table(x = epithelial_de_gsea, file = 'outdir_single_cell_rna/epithelial_de
 
 For the T cell focused analysis, we will ask how T cells from mice treated with ICB compare against T cells from mice with (some of) their T cells depleted treated with ICB (ICBdT). We will start by subsetting our `merged` object to only have T cells, by combining the various T cell annotations from celltyping section. We'll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will `SetIdent` to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we'll doublecheck that the subsetting happened as we expected it to. 
 
-**TODO UPDATE THESE BASED ON FINAL CELLTYPING**
-
 ```R
 #check all the annotated celltypes
 unique(merged$immgen_singler_main)

diff --git a/_posts/0008-05-01-Gene_set_enrichment.md b/_posts/0008-05-01-Gene_set_enrichment.md
@@ -25,7 +25,7 @@ There are various tools available for enrichment analysis, here we chose to use
 
 We will also use a web tool [https://maayanlab.cloud/Enrichr/enrich](https://maayanlab.cloud/Enrichr/enrich) for some of our analysis.
 
-We will start by investigating the Epcam positive clusters we identified in the Differential Expression section. Let's load in the R libraries we will need and read in the DE file we generated previously. Recall that we generated this file using the `FindMarkers` function in Seurat, and had `ident.1` as `cluster 8` and `ident.2` as `cluster 12`. Therefore, we are looking at `cluster 8` with respect to `cluster 12`, that is, positive log2FC values correspond to genes upregulated in `cluster 8` or downregulated in `cluster 12` and vice versa for negative log2FC values.
+We will start by investigating the Epcam positive clusters we identified in the Differential Expression section. Let's load in the R libraries we will need and read in the DE file we generated previously. Recall that we generated this file using the `FindMarkers` function in Seurat, and had `ident.1` as `cluster 9` and `ident.2` as `cluster 12`. Therefore, we are looking at `cluster 9` with respect to `cluster 12`, that is, positive log2FC values correspond to genes upregulated in `cluster 9` or downregulated in `cluster 12` and vice versa for negative log2FC values.
 
 ```R
 #load R libraries
@@ -78,7 +78,7 @@ write.table(x = overrep_gene_list, file = 'outdir_single_cell_rna/epithelial_ove
 
 For the Enrichr webtool based analysis, we'll open that TSV file in our Rstudio session, copy the genes, and paste them directly into the textbox on the right. The webtool should load multiple barplots with different enriched pathways. Feel free to click around and explore here. To compare the results against the results we generated in R, navigate to the `Cell Types` tab on the top and look for `Tabula Muris`. 
 
-An important component to a 'good' overrepresentation analysis is using one's expertise about the biology in conjunction with the pathways identified to generate hypotheses. It is unlikely that every pathway in the plots above is meaningful, however knowledge of bladder cancer (for this dataset) tells us that basal and luminal bladder cancers share similar expression profiles to basal and luminal breast cancers [reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5078592/). So, the overrepresentation analysis showing genesets like 'Tabula Muris senis mammary gland basal cell ageing' and 'Tabula muris senis mammary gland luminal epithelial cell of mammary gland ageing' could suggest that the difference in unsupervised clusters 8 and 12 could be coming from the basal and luminal cells. To investigate this further, we can compile a list of basal and luminal markers from the literature, generate a combined score for those genes using Seurat's `AddModuleScore` function and determine if the clusters are split up as basal and luminal. For now we'll use the same markers defined in this dataset's original manuscript.
+An important component to a 'good' overrepresentation analysis is using one's expertise about the biology in conjunction with the pathways identified to generate hypotheses. It is unlikely that every pathway in the plots above is meaningful, however knowledge of bladder cancer (for this dataset) tells us that basal and luminal bladder cancers share similar expression profiles to basal and luminal breast cancers [reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5078592/). So, the overrepresentation analysis showing genesets like 'Tabula Muris senis mammary gland basal cell ageing' and 'Tabula muris senis mammary gland luminal epithelial cell of mammary gland ageing' could suggest that the difference in unsupervised clusters 9 and 12 could be coming from the basal and luminal cells. To investigate this further, we can compile a list of basal and luminal markers from the literature, generate a combined score for those genes using Seurat's `AddModuleScore` function and determine if the clusters are split up as basal and luminal. For now we'll use the same markers defined in this dataset's original manuscript.
 
 **TODO UPDATE BASED ON SAVED RDS OBJECT**
 ```R
@@ -98,7 +98,7 @@ FeaturePlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1')
 VlnPlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1'), group.by = 'seurat_clusters_res0.8', pt.size=0)
 ```
 
-Interesting! This analysis could lead us to conclude that cluster 12 is composed of basal epithelial cells, while cluster 8 is composed of luminal epithelial cells. Next, let's see if we can use GSEA to determine if there are certain biological processes that are distinct between these clusters?
+Interesting! This analysis could lead us to conclude that cluster 12 is composed of basal epithelial cells, while cluster 9 is composed of luminal epithelial cells. Next, let's see if we can use GSEA to determine if there are certain biological processes that are distinct between these clusters?
 
 For GSEA, we need to start by creating a named vector where the values are the log fold change values and the names are the gene's names. Recall that GSEA analysis relies on identifying any incremental gene expression changes (not just those that are statistically significant), so we will use our original unfiltered dataframe to get these values. This will be used as input to the `gseGO` function in the `clusterProfiler` library, which uses gene ontology for GSEA analysis. The other parameters for the function include `OrgDb = org.Mm.eg.db`, the organism database from where all the pathways' genesets will be determined; `ont = "ALL"`, specifies the subontologies, with possible options being `BP (Biological Process)`, `MF (Molecular Function)`, `CC (Cellular Compartment)`, or `ALL`; `keyType = "SYMBOL"` tells `gseGO` that the genes in our named vector are gene symbols as opposed to Entrez IDs, or Ensembl IDs; and `pAdjustMethod="BH"` and `pvalueCutoff=0.05` specify the p-value adjustment statistical method to use and the corresponding cutoff. 
 
@@ -148,7 +148,7 @@ heatplot(gse_epithelial, foldChange=gene_list)
 cnetplot(gse_epithelial, foldChange=gene_list)
 ```
 
-Based on these results, we could conclude that cluster 8 (putative luminal cells) have lower expression of quite a few pathways related to epithelial cell proliferation compared to cluster 12 (putative basal cells). 
+Based on these results, we could conclude that cluster 9 (putative luminal cells) have lower expression of quite a few pathways related to epithelial cell proliferation compared to cluster 12 (putative basal cells). 
 
 
 

diff --git a/_posts/0008-06-01-Cancer_cell_identification.md b/_posts/0008-06-01-Cancer_cell_identification.md
@@ -22,7 +22,9 @@ For copy number alterations (CNAs), there are various tools that try to detect C
 ### Finding tumor cells based on Copy number data
 
 If you are working in a tumor sample where you expect to find CNAs, looking for copy number alterations in the scRNAseq data can be one way to identify tumor cells. In our case, whole genome sequencing was done on the cell line used for the mouse models, so we have some confidently determined CNAs we expect to find in the scRNAseq data- 
+
 ![CNV_LPWGS_scatterplot](/assets/module_8/CNV_scatterplot_fig2_manuscript.png)
+
 As we can see above, we expect to find gains in chromosome 2 and 11 and a loss in chromosome 12. 
 
 We will use CONICSmat to identify cells with CNAs in our scRNAseq data. While there is more information available on their [GitHub tutorial](https://github.com/diazlab/CONICS/wiki/Tutorial---CONICSmat;---Dataset:-SmartSeq2-scRNA-seq-of-Oligodendroglioma) and [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7190654/), briefly CONICSmat fits a two-component Gaussian mixture model for each chromosomal region, and uses a Bayesian Information Criterion (BIC) statistical test to ask if a 1-component model (all cells are the same and there's no CNA) fits better than a 2-component model (some cells have altered copy number compared to others). Here, a better fit is defined by a lower BIC score. Note that since we are only using the gene by cell counts matrix, average expression of genes in that chromosomal region is used as a proxy for a CNA. The key to most single-cell CNA based tools is that we need both tumor cells and non-tumor cells in our analysis as a copy number gain or loss in the tumor cells can only be measured relative to healthy cells. While CONICSmat can be run on all cells in the sample together, for our purposes, we will subset the object to Epithelial cells and B cells so that it can run more efficiently. CONICSmat is a somewhat computationally intensive tool, so we will start by clearing our workspace using the broom icon on the top right pane, and also click on the drop-down menu with a piechart beside it and select `Free unused memory`.
@@ -40,7 +42,7 @@ library("hdf5r")
 library("CONICSmat")
 library("showtext")
 
-merged <- readRDS('merged_processed_object.rds')
+merged <- readRDS('preprocessed_object.rds')
 
 #subset Seurat object to only include epithelial cells and B cells
 merged <- SetIdent(merged, value = 'immgen_singler_main')