diff --git a/_posts/0008-04-01-DE_analysis.md b/_posts/0008-04-01-DE_analysis.md
index 66511c2..bd55906 100644
--- a/_posts/0008-04-01-DE_analysis.md
+++ b/_posts/0008-04-01-DE_analysis.md
@@ -24,7 +24,7 @@ library(Seurat)
 library(dplyr)
 library(EnhancedVolcano)
 library(presto)
-merged <- readRDS('processed_object_0409.rds')
+merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')
 ```
 
 ### Gene expression analysis for epithelial cells
@@ -83,14 +83,14 @@ FeaturePlot(merged, features = 'Epcam') +
 DimPlot(merged, group.by = 'immgen_singler_main', label = TRUE)
 ```
 
-While the plots generated by the above commands make it pretty clear that the clusters of interest are clusters 10 and 12, sometimes it is trickier to determine which cluster we are interested in solely from the UMAP as the clusters may be overlapping. In this case, a violin plot `VlnPlot` may be more helpful. Similar to `FeaturePlot`, `VlnPlot` also takes the Seurat object and `features` as input. It also requires a `group.by` argument that determines the x-axis groupings of the cells.  
+While the plots generated by the above commands make it pretty clear that the clusters of interest are clusters 8 and 12, sometimes it is trickier to determine which cluster we are interested in solely from the UMAP as the clusters may be overlapping. In this case, a violin plot `VlnPlot` may be more helpful. Similar to `FeaturePlot`, `VlnPlot` also takes the Seurat object and `features` as input. It also requires a `group.by` argument that determines the x-axis groupings of the cells.  
 To learn more about customizing a Violin plot, please refer to the [Seurat documentation](https://satijalab.org/seurat/reference/vlnplot)
 
 ```R
 VlnPlot(merged, group.by = 'seurat_clusters_res0.8', features = 'Epcam')
 ```
 
-Thus we were able to confirm that clusters 10 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. Let's use differential expression analysis to determine how these clusters differ from each other.
+Great! Looks like we can confirm that clusters 8 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. This is a good place to use differential expression analysis to determine how these clusters differ from each other.
 
 ### Differential expression for epithelial cells
 
@@ -99,7 +99,7 @@ We can begin by restricting the Seurat object to the cells we are interested in.
 ```R
 #set ident to seurat clusters metadata column and subset object to Epcam positive clusters
 merged <- SetIdent(merged, value = 'seurat_clusters_res0.8')
-merged_epithelial <- subset(merged, idents = c('10', '12'))
+merged_epithelial <- subset(merged, idents = c('8', '12'))
 
 #confirm that we have subset the object as expected visually using a UMAP
 DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
@@ -112,20 +112,20 @@ table(merged_epithelial$seurat_clusters_res0.8)
 
 Now we will use Seurat's `FindMarkers` function to carry out a differential expression analysis between both groups. `FindMarkers` also requires that we use `SetIdent` to change the default 'Ident' to the metadata column we want to use for our comparison. More information about `FindMarkers` is available [here](https://satijalab.org/seurat/reference/findmarkers).
 
-Note that here we use `FindMarkers` to compare clusters 10 and 12. The default syntax of `FindMarkers` requires that we provide each group of cells as `ident.1` and `ident.2`. The output of `FindMarkers` is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of `ident.1` with respect to `ident.2`. Therefore, genes upregulated in `ident.1` have positive log2FC, while those downregulated in `ident.1` have negative log2FC. Here, we also provide a `min.pct=0.25` argument so that we only test genes that are expressed in 25% of cells in either of the `ident.1` or `ident.2` groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. We also specify the `logfc.threshold=0.1` parameter, which ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the `min.pct` and `logfc.threshold` parameters can also result in the function running faster as they reduce the number of genes being tested.
+Note that here we use `FindMarkers` to compare clusters 8 and 12. The default syntax of `FindMarkers` requires that we provide each group of cells as `ident.1` and `ident.2`. The output of `FindMarkers` is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of `ident.1` with respect to `ident.2`. Therefore, genes upregulated in `ident.1` have positive log2FC, while those downregulated in `ident.1` have negative log2FC. Here, we also provide a `min.pct=0.25` argument so that we only test genes that are expressed in 25% of cells in either of the `ident.1` or `ident.2` groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. We also specify the `logfc.threshold=0.1` parameter, which ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the `min.pct` and `logfc.threshold` parameters can also result in the function running faster as they reduce the number of genes being tested.
 
 ```R
 #carry out DE analysis between both groups
 merged_epithelial <- SetIdent(merged_epithelial, value = "seurat_clusters_res0.8")
-epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "10", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 10 changes wrt cluster 12
+epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "8", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 8 changes wrt cluster 12
 ```
 On opening `epithelial_de` in your RStudio session, you'll see that it is a dataframe with the genes as rownames, and the following columns- `p_val`, `avg_log2FC`, `pct.1`, `pct.2`, `p_val_adj`. The p-values are dependent on the test used while running `FindMarkers`, and the adjusted p-value is based on the bonferroni correction test. `pct.1` and `pct.2` are the percentages of cells where the gene is detected in the `ident.1` and `ident.2` groups respectively. 
 
-Next we can subset this dataframe to only include DE genes that have a significant p-value, and then further subset the 'significant DE genes only' dataframe to the top 20 genes with the highest absolute log2FC. Looking a the absolute log2FC allows us to capture both, the upregulated and downregulated genes.
+Next we can subset this dataframe to only include DE genes that have a significant p-value, and then further subset the 'significant DE genes only' dataframe to the top 20 genes with the highest absolute log2FC. Looking at the absolute log2FC allows us to capture both, upregulated and downregulated genes.
 
 ```R
 #restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
-epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,]
+epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,] 
 
 #get the top 20 genes by fold change
 epithelial_de_sig %>%
@@ -134,7 +134,7 @@ epithelial_de_sig %>%
 
 `epithelial_de_sig_top20` is a dataframe that is restricted to the top20 most differentially expressed genes by log2FC.
 
-There are a few different ways we can visualize the differentially expressed genes. We'll start by with the Violin and Feature plots from before. We can also visualize DEs using a DotPlot that allows us to capture both the average expression of a gene and the % of cells expressing it. In addition to these in-built Seurat functions, we can also generate a volcano plot using the `EnhancedVolcano` package. For the volcano plot, we can use the unfiltered DE results as the function colors and labels genes based on the parameters (`pCutoff`, `FCcutoff`) we specify.
+There are a few different ways we can visualize the differentially expressed genes. We'll start with the Violin and Feature plots from before. We can also visualize DEs using a DotPlot that allows us to capture both the average expression of a gene and the % of cells expressing it. In addition to these in-built Seurat functions, we can also generate a volcano plot using the `EnhancedVolcano` package. For the volcano plot, we can use the unfiltered DE results as the function colors and labels genes based on parameters (`pCutoff`, `FCcutoff`) we specify.
 
 ```R
 #get list of top 20 DE genes for ease
@@ -154,7 +154,7 @@ EnhancedVolcano(epithelial_de,
   lab = rownames(epithelial_de),
   x = 'avg_log2FC',
   y = 'p_val_adj',
-  title = 'Cluster10 wrt Cluster 12',
+  title = 'Cluster8 wrt Cluster 12',
   pCutoff = 0.05,
   FCcutoff = 0.5,
   pointSize = 3.0,
@@ -162,17 +162,18 @@ EnhancedVolcano(epithelial_de,
   colAlpha = 0.3)
 ```
 
-To find out how we can figure out what these genes mean, stay tuned! The next module on pathway analysis will help shed some light on that. Let's create a TSV file containing our DE results for use later on. We will need to rerun `FindMarkers` with slightly different parameters for this- we will change the `logfc.threshold` parameter to 0, as one of the pathway analysis tools requires all genes to be included in the analysis (more on that later).
+To find out how we can figure out what these genes mean, stay tuned! The next module on pathway analysis will help shed some light on that. For now, let's create a TSV file containing our DE results for use later on. We will need to rerun `FindMarkers` with slightly different parameters for this- we will change the `logfc.threshold` parameter to 0, as one of the pathway analysis tools requires all genes to be included in the analysis (more on that later).
+
 ```R
 #rerun FindMarkers
-epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "10", ident.2 = "12", min.pct=0.25, logfc.threshold=0)
+epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "8", ident.2 = "12", min.pct=0.25, logfc.threshold=0)
 #save this table as a TSV file
-write.table(x = epithelial_de_gsea, file = 'epithelial_de_gsea.tsv', sep='\t')
+write.table(x = epithelial_de_gsea, file = 'outdir_single_cell_rna/epithelial_de_gsea.tsv', sep='\t')
 ``` 
 
 ### Differential expression for T cells
 
-For the T cell focused analysis, we will start by subsetting our `merged` object to only have T cells, by combining the various T cell annotations from celltyping section. We'll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will `SetIdent` to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we'll doublecheck that the subsetting happened as we expected it to. 
+For the T cell focused analysis, we will ask how T cells from mice treated with ICB compare against T cells from mice with (some of) their T cells depleted treated with ICB (ICBdT). We will start by subsetting our `merged` object to only have T cells, by combining the various T cell annotations from celltyping section. We'll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will `SetIdent` to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we'll doublecheck that the subsetting happened as we expected it to. 
 
 **TODO UPDATE THESE BASED ON FINAL CELLTYPING**
 
@@ -194,7 +195,7 @@ table(merged$immgen_singler_main)
 table(merged_tcells$immgen_singler_main)
 ```
 
-Now we want to compare cells from mice treated with ICB vs mice with their T cells depleted treated with ICB (ICBdT). First, we need to distinguish the ICB and ICBdT cells from each other. Start by clicking on the object in RStudio and expand `meta.data` to get a snapshot of the columns and the what kind of data they hold. 
+Now we want to compare T cells from mice treated with ICB vs ICBdT. First, we need to distinguish the ICB and ICBdT cells from each other. Start by clicking on the object in RStudio and expand `meta.data` to get a snapshot of the columns and the what kind of data they hold. 
 
 Looks like `orig.ident` has information about the condition and replicates, but for the purposes of this DE analysis, we want a `meta.data` column that combines the replicates of each condition. So, we want to combine the replicates of each condition together into a single category.
 
@@ -205,12 +206,14 @@ unique(merged_tcells$orig.ident)
 #there are 6 possible values, 3 replicates for the ICB treatment condition, and 3 for the ICBdT condition
 #so we can combine "Rep1_ICB", "Rep3_ICB", "Rep5_ICB" to ICB, and "Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT" to ICBdT. 
 #first initialize a metadata column for experimental_condition
-merged@meta.data$experimental_condition <- NA
+merged_tcells@meta.data$experimental_condition <- NA
+
 #Now we can take all cells that are in each replicate-condition, and assign them to the appropriate condition
-merged@meta.data$experimental_condition[merged@meta.data$orig.ident %in% c("Rep1_ICB", "Rep3_ICB", "Rep5_ICB")] <- "ICB"
-merged@meta.data$experimental_condition[merged@meta.data$orig.ident %in% c("Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT")] <- "ICBdT"
+merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICB", "Rep3_ICB", "Rep5_ICB")] <- "ICB"
+merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT")] <- "ICBdT"
+
 #double check that the new column we generated makes sense (each replicate should correspond to its experimental condition)
-table(merged@meta.data$orig.ident, merged@meta.data$experimental_condition)
+table(merged_tcells@meta.data$orig.ident, merged_tcells@meta.data$experimental_condition)
 ```
 
 With the experimental conditions now defined, we can compare the T cells from both groups. We'll start by using `FindMarkers` using similar parameters to last time, and see how ICBdT changes with respect to ICB. Next, restrict the dataframe to significant genes only, and then look at the top 5 most upregulated and downregulated DE genes by log2FC. 
@@ -221,7 +224,7 @@ merged_tcells <- SetIdent(merged_tcells, value = "experimental_condition")
 tcells_de <- FindMarkers(merged_tcells, ident.1 = "ICBdT", ident.2 = "ICB", min.pct=0.25)
 
 #restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
-epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,]
+tcells_de_sig <- tcells_de[tcells_de$p_val_adj < 0.001,]
 
 #find the top 5 most downregulated genes
 tcells_de_sig %>%
@@ -236,10 +239,9 @@ The most downregulated gene in the ICBdT condition based on foldchange is Cd4. T
 Interestingly, for the list of genes that are upregulated, we see Cd8b1 show up. It could be interesting to see if the CD8 T cells' phenotype changes based on the treatment condition. So now, let's subset the object to CD8 T cells only, find DE genes to see how ICBdT CD8 T cells change compared to ICB CD8 T cells, and visualize these similar to before. 
 
 ```R
-#subset object to CD8 T cells
-cd8t_celltypes_names <- c("T cells (T.8EFF.OT1.12HR.LISOVA)","T cells (T.8EFF.OT1.24HR.LISOVA)","T cells (T.8EFF.OT1.48HR.LISOVA)","T cells (T.8EFF.OT1.D10LIS)","T cells (T.8EFF.OT1.D15.LISOVA)","T cells (T.8EFF.OT1.D45VSV)","T cells (T.8EFF.OT1.D5.VSVOVA)","T cells (T.8EFF.OT1.D8.VSVOVA)","T cells (T.8EFF.OT1.D8LISO)","T cells (T.8EFF.OT1.LISOVA)","T cells (T.8EFF.OT1.VSVOVA)","T cells (T.8EFF.OT1LISO)","T cells (T.8EFF.TBET-.OT1LISOVA)","T cells (T.8EFF.TBET+.OT1LISOVA)","T cells (T.8EFFKLRG1+CD127-.D8.LISOVA)","T cells (T.8MEM.OT1.D100.LISOVA)","T cells (T.8MEM.OT1.D106.VSVOVA)","T cells (T.8MEM.OT1.D45.LISOVA)","T cells (T.8Mem)","T cells (T.8MEM)","T cells (T.8MEMKLRG1-CD127+.D8.LISOVA)","T cells (T.8NVE.OT1)","T cells (T.8Nve)","T cells (T.8NVE)","T cells (T.8SP24-)","T cells (T.8SP69+)","T cells (T.CD8.1H)","T cells (T.CD8.24H)","T cells (T.CD8.48H)","T cells (T.CD8.5H)","T cells (T.CD8.96H)","T cells (T.CD8.CTR)")
-merged_tcells <- SetIdent(merged_tcells, value = 'immgen_singler_fine')
-merged_cd8tcells <- subset(merged_tcells, idents = cd8t_celltypes_names)
+#subset object to CD8 T cells. Since we already showed how to subset cells using the clusters earlier, this time we'll subset to CD8 T cells by selecting for cells with high 
+#expression of Cd8 genes and low expression of Cd4 genes
+merged_cd8tcells <- subset(merged_tcells, subset= Cd8b1 > 1 & Cd8a > 1 & Cd4 < 0.1)
 
 #carry out DE analysis between both groups
 merged_cd8tcells <- SetIdent(merged_cd8tcells, value = "experimental_condition")
diff --git a/_posts/0008-05-01-Gene_set_enrichment.md b/_posts/0008-05-01-Gene_set_enrichment.md
index b90d06d..b247111 100644
--- a/_posts/0008-05-01-Gene_set_enrichment.md
+++ b/_posts/0008-05-01-Gene_set_enrichment.md
@@ -11,13 +11,13 @@ date: 0008-05-01
 
 ***
 
-After carrying out differential expression analysis, and getting a list of interesting genes, a common next step is enrichment or pathway analyses. Broadly enrichment analyses can be divided into two types- overrepresentation analysis and  gene set enrichment analysis (GSEA). 
+After carrying out differential expression analysis, and getting a list of interesting genes, a common next step is enrichment or pathway analyses. Broadly, enrichment analyses can be divided into two types- overrepresentation analysis and  gene set enrichment analysis (GSEA). 
 
-Overrepresentation analysis takes a list of significantly DE genes and determines if these genes are all known to be differentially regulated in a certain pathway or geneset. It is primarily useful if we have a set of genes that are highly differentially expressed and we want to determine what process(es) they may be involved in. Mathematically, it calculates a p-value using the hypergeometric distribution to determine if a gene set (from a database) is significantly over-represented in our DE genes. A couple key points about overrepresentation analysis are that firstly, we get to determine the list of genes that are used as inputs. So, we can set a p-value and log2FC threshold that would in turn determine the gene list. Secondly, since the overrepresentation analysis does not use information about the foldchange values (only a list of genes) it is not directional. So if an overrepresentation analysis gives us a pathway or geneset as being significantly enriched, we are not getting any information about whether the genes in our list are responsible for activating or suppressing the pathway- we can only conclude that our genes are involved in that pathway in some way.
+Overrepresentation analysis takes a list of significantly DE genes and determines if these genes are all known to be differentially regulated in a certain pathway or geneset. It is primarily useful if we have a set of genes that are highly differentially expressed and we want to determine what process(es) they may be involved in. Mathematically, it calculates a p-value using a hypergeometric distribution to determine if a gene set (from a database) is significantly over-represented in our DE genes. A couple key points about overrepresentation analysis are that firstly, we get to determine the list of genes that are used as inputs. So, we can set a p-value and log2FC threshold that would in turn determine the gene list. Secondly, since the overrepresentation analysis does not use information about the foldchange values (only a list of genes) it is not directional. So if an overrepresentation analysis gives us a pathway or geneset as being significantly enriched, we are not getting any information about whether the genes in our list are responsible for activating or suppressing the pathway- we can only conclude that our genes are involved in that pathway in some way.
 
-GSEA addresses the second point above because it uses a list of genes and their corresponding fold change values as inputs to the analysis. It is also different from an overrepresentation analysis because in this case we will use all the genes as inputs without applying any filters based on log2FC or p-values. GSEA is useful in determinining incremental changes at the gene expression level that may come together to have an impact on a specific pathway. GSEA ranks genes based on their 'enrichment scores' (ES), which measures the degree to which a set of genes is over-represented at the top or bottom of a list of genes ranked or ordered based on their log2FC values.
+GSEA addresses the second point above because it uses a list of genes and their corresponding fold change values as inputs to the analysis. Another difference between GSEA and overrepresentation analysis is that in GSEA, we will use all the genes as inputs without applying any filters based on log2FC or p-values. GSEA is useful in determining incremental changes at the gene expression level that may come together to have an impact on a specific pathway. GSEA ranks genes based on their 'enrichment scores' (ES), which measures the degree to which a set of genes is over-represented at the top or bottom of a list of genes that are ordered based on their log2FC values.
 
-Another crucial part of any enrichment analysis is the databases. The main pitfall to avoid is choosing multiple or broad databases as this can result in many spurious results. Therefore, the reference databases should be chosen based on their biological relevance.
+Another crucial part of any enrichment analysis is the databases. The main pitfall to avoid is choosing multiple or broad databases as this can result in many spurious results. Therefore, when possible, it is better to choose the reference databases based on their biological relevance.
 
 ***
 
@@ -25,7 +25,7 @@ There are various tools available for enrichment analysis, here we chose to use
 
 We will also use a web tool [https://maayanlab.cloud/Enrichr/enrich](https://maayanlab.cloud/Enrichr/enrich) for some of our analysis.
 
-We will start by investigating the Epcam positive clusters we identified in the Differential Expression section. Let's load in the R libraries we will need and read in the DE file we generated previously. Recall that we generated this file using the `FindMarkers` function in Seurat, and had `ident.1` as `cluster 10` and `ident.2` as `cluster 12`, therefore, we are looking at `cluster 10` with respect to `cluster 12`, that is, positive log2FC values correspond to genes upregulated in `cluster 10` and downregulated in `cluster 12` and vice versa for negative log2FC values.
+We will start by investigating the Epcam positive clusters we identified in the Differential Expression section. Let's load in the R libraries we will need and read in the DE file we generated previously. Recall that we generated this file using the `FindMarkers` function in Seurat, and had `ident.1` as `cluster 8` and `ident.2` as `cluster 12`. Therefore, we are looking at `cluster 8` with respect to `cluster 12`, that is, positive log2FC values correspond to genes upregulated in `cluster 8` or downregulated in `cluster 12` and vice versa for negative log2FC values.
 
 ```R
 #load R libraries
@@ -41,14 +41,16 @@ library("stringr")
 library("enrichplot")
 
 #read in the epithelial DE file
-de_gsea_df <- read.csv('epithelial_de_gsea.tsv', sep = '\t')
+de_gsea_df <- read.csv('outdir_single_cell_rna/epithelial_de_gsea.tsv', sep = '\t')
 
 head(de_gsea_df)
 #open this file in Rstudio and get a sense for the distribution of foldchange values and see if their p values are significant
-#alternatively try making a histogram of log2FC values using ggplot and the geom_histogram() function. You can also go one step further and impose a p-value cutoff (say 0.01) and plot the distribution. 
+#alternatively try making a histogram of log2FC values using ggplot and the geom_histogram() function. 
+ggplot(de_gsea_df, aes(avg_log2FC)) + geom_histogram()
+#You can also go one step further and impose a p-value cutoff (say 0.01) and plot the distribution. 
 ```
 
-You may notice that we have quite a few genes with fairly large fold change values- while this does not impact the GSEA analysis, this can inform the thresholds we use for the overrepresentation analysis. Since we know that we have quite a few genes with foldchanges greater/lower than +/- 2, we can use that as our cutoff. We will also impose an adj p-value cutoff of 0.01. Thus, for the overrepresentation analysis, we will begin by filtering `de_gsea_df` based on the log2FC and p-value, and then get the list of genes for our analysis.
+You may notice that we have quite a few genes with fairly large fold change values- while fold change values do not impact the overrepresentation analysis, they can inform the thresholds we use for picking the genes. Since we know that we have quite a few genes with foldchanges greater/lower than +/- 2, we can use that as our cutoff. We will also impose an adj p-value cutoff of 0.01. Thus, for the overrepresentation analysis, we will begin by filtering `de_gsea_df` based on the log2FC and p-value, and then get the list of genes for our analysis.
 
 ```R
 #filter de_gsea_df by subsetting it to only include genes that are significantly DE (pval<0.01) and their absolute log2FC is > 2. The abs(de_gsea_df$avg_log2FC) ensures that we keep both the up and downregulated genes
@@ -56,11 +58,11 @@ overrep_df <- de_gsea_df[de_gsea_df$p_val_adj < 0.01 & abs(de_gsea_df$avg_log2FC
 overrep_gene_list <- rownames(overrep_df)
 ```
 
-Next, we will set up our reference. By default `clusterProfiler` allows us to use the [msigdb reference](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). While that works, here we will show how you can download a mouse specific celltype signature reference geneset from msigdb and use that for your analysis. We will use the M8 geneset from the [msigdb mouse collections](https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp?targetSpeciesDB=Mouse#M8). We clicked on the `Gene Symbols` link on the right to download the dataset and uploaded that to your workspace. These files are in a `gmt` (gene matrix transposed) format, and can be read-in using an in-built R function, `read.gmt`. And once we have the reference data loaded, we will use the `enricher` function in the `clusterProfiler` library for the overrepresentation analysis. The inputs to the function include the DE gene list, the reference database, the statistical method for p-value adjustment, and finally a pvalue cutoff threshold. This generates an overrepresentation R object that can be input into visualization functions like `barplot()` and `dotplot()` to make some typical pathway analysis figures. We can also the webtool, https://maayanlab.cloud/Enrichr/, for a quick analysis against multiple databases. For this part, let's save the genelist we're using for the overrepresentation analysis to a TSV file.
+Next, we will set up our reference. By default `clusterProfiler` allows us to use the [msigdb reference](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). While that works, here we will show how you can download a mouse specific celltype signature reference geneset from msigdb and use that for your analysis. We will use the M8 geneset from the [msigdb mouse collections](https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp?targetSpeciesDB=Mouse#M8). We clicked on the `Gene Symbols` link on the right to download the dataset and uploaded that to your workspace. These files are in a `gmt` (gene matrix transposed) format, and can be read-in using an in-built R function, `read.gmt`. And once we have the reference data loaded, we will use the `enricher` function in the `clusterProfiler` library for the overrepresentation analysis. The inputs to the function include the DE gene list, the reference database, the statistical method for p-value adjustment, and finally a pvalue cutoff threshold. This generates an overrepresentation R object that can be input into visualization functions like `barplot()` and `dotplot()` to make some typical pathway analysis figures. We can also the webtool, https://maayanlab.cloud/Enrichr/, for a quick analysis against multiple databases. For this part, we will save the genelist we're using for the overrepresentation analysis to a TSV file.
 
 ```R
 #read in the tabula muris gmt file
-msigdb_m8 <- read.gmt('m8.all.v2023.2.Mm.symbols.gmt')
+msigdb_m8 <- read.gmt('/cloud/project/data/single_cell_rna/reference_files/m8.all.v2023.2.Mm.symbols.gmt')
 #click on the dataframe in RStudio to see how it's formatted- we have 2 columns, the first with the genesets, and the other with genes that are in that geneset.
 #try to determine how many different pathways are in this database
 overrep_msigdb_m8 <- enricher(gene = overrep_gene_list, TERM2GENE = msigdb_m8, pAdjustMethod = "BH", pvalueCutoff = 0.05)
@@ -71,20 +73,21 @@ dotplot(overrep_msigdb_m8, showCategory = 10)
 
 #save overrep_gene_list to a tsv file (overrep_gene_list is our list of genes and file is the name we want the file to have when it's saved. 
 #The remaining arguments are optional- row.names=FALSE stops R from adding numbers (effectively an S.No column), col.names gives our single column TSV a column name, and quote=FALSE ensures the genes don't have quotes around them which is the default way R saves string values to a TSV)
-write.table(x = overrep_gene_list, file = 'epithelial_overrep_gene_list.tsv', row.names = FALSE, col.names = 'overrep_genes', quote=FALSE)
+write.table(x = overrep_gene_list, file = 'outdir_single_cell_rna/epithelial_overrep_gene_list.tsv', row.names = FALSE, col.names = 'overrep_genes', quote=FALSE)
 ```
 
 For the Enrichr webtool based analysis, we'll open that TSV file in our Rstudio session, copy the genes, and paste them directly into the textbox on the right. The webtool should load multiple barplots with different enriched pathways. Feel free to click around and explore here. To compare the results against the results we generated in R, navigate to the `Cell Types` tab on the top and look for `Tabula Muris`. 
 
-An important component to a 'good' overrepresentation analysis is using one's expertise about the biology in conjunction with the pathways identified to generate hypotheses. It is unlikely that every pathway in the plots above is meaningful, however knowledge of bladder cancer (for this dataset) tells us that basal and luminal bladder cancers share similar expression profiles to basal and luminal breast cancers [reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5078592/). So, the overrepresentation analysis showing genesets like 'Tabula Muris senis mammary gland basal cell ageing' and 'Tabula muris senis mammary gland luminal epithelial cell of mammary gland ageing' could suggest that the difference in the unsupervised clusters 10 and 12 could be coming from the basal and luminal cells. To investigate this further, we can compile a list of basal and luminal markers from the literature, generate a combined score for those genes using Seurat's `AddModuleScore` function and determine if the clusters are split up as basal and luminal. For now we'll use the same markers defined in this dataset's manuscript.
+An important component to a 'good' overrepresentation analysis is using one's expertise about the biology in conjunction with the pathways identified to generate hypotheses. It is unlikely that every pathway in the plots above is meaningful, however knowledge of bladder cancer (for this dataset) tells us that basal and luminal bladder cancers share similar expression profiles to basal and luminal breast cancers [reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5078592/). So, the overrepresentation analysis showing genesets like 'Tabula Muris senis mammary gland basal cell ageing' and 'Tabula muris senis mammary gland luminal epithelial cell of mammary gland ageing' could suggest that the difference in unsupervised clusters 8 and 12 could be coming from the basal and luminal cells. To investigate this further, we can compile a list of basal and luminal markers from the literature, generate a combined score for those genes using Seurat's `AddModuleScore` function and determine if the clusters are split up as basal and luminal. For now we'll use the same markers defined in this dataset's original manuscript.
 
+**TODO UPDATE BASED ON SAVED RDS OBJECT**
 ```R
 #define lists of marker genes
 basal_markers <- c('Cd44', 'Krt14', 'Krt5', 'Krt16', 'Krt6a')
 luminal_markers <- c('Cd24a', 'Erbb2', 'Erbb3', 'Foxa1', 'Gata3', 'Gpx2', 'Krt18', 'Krt19', 'Krt7', 'Krt8', 'Upk1a')
 
 #read in the seurat object if it isn't loaded in your R session
-merged <- readRDS('processed_object_0409.rds')
+merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')
 
 #use AddModuleScore to calculate a single score that summarizes the gene expression for each list of markers
 merged <- AddModuleScore(merged, features=list(basal_markers), name='basal_markers_score')
@@ -95,7 +98,7 @@ FeaturePlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1')
 VlnPlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1'), group.by = 'seurat_clusters_res0.8', pt.size=0)
 ```
 
-Interesting! This analysis could lead us to conclude that cluster 12 composed of basal epithelial cells, while cluster 10 is composed of luminal epithelial cells. Next, can we use GSEA to determine if there are certain biological processes that are distinct between these clusters?
+Interesting! This analysis could lead us to conclude that cluster 12 is composed of basal epithelial cells, while cluster 8 is composed of luminal epithelial cells. Next, let's see if we can use GSEA to determine if there are certain biological processes that are distinct between these clusters?
 
 For GSEA, we need to start by creating a named vector where the values are the log fold change values and the names are the gene's names. Recall that GSEA analysis relies on identifying any incremental gene expression changes (not just those that are statistically significant), so we will use our original unfiltered dataframe to get these values. This will be used as input to the `gseGO` function in the `clusterProfiler` library, which uses gene ontology for GSEA analysis. The other parameters for the function include `OrgDb = org.Mm.eg.db`, the organism database from where all the pathways' genesets will be determined; `ont = "ALL"`, specifies the subontologies, with possible options being `BP (Biological Process)`, `MF (Molecular Function)`, `CC (Cellular Compartment)`, or `ALL`; `keyType = "SYMBOL"` tells `gseGO` that the genes in our named vector are gene symbols as opposed to Entrez IDs, or Ensembl IDs; and `pAdjustMethod="BH"` and `pvalueCutoff=0.05` specify the p-value adjustment statistical method to use and the corresponding cutoff. 
 
@@ -136,16 +139,16 @@ gse_epithelial@result <- gse_epithelial@result[subset_indices,]
 
 #plot!
 #dotplot - splitting by 'sign' and facet_grid together allow us to separate activated and suppressed pathways
-dotplot(gse_copy, showCategory=20, split=".sign") + facet_grid(.~.sign) 
+dotplot(gse_epithelial, showCategory=20, split=".sign") + facet_grid(.~.sign) 
 
 #heatplot - allows us to see the genes that are being considered for each of the pathways/genesets and their corresponding fold change
-heatplot(gse_copy, foldChange=gene_list)
+heatplot(gse_epithelial, foldChange=gene_list)
 
 #cnetplot - allows us to see the genes along with the various pathways/genesets and how they related to each other
-cnetplot(gse_copy, foldChange=gene_list)
+cnetplot(gse_epithelial, foldChange=gene_list)
 ```
 
-Based on these results, we could conclude that cluster 10 (putative luminal cells) are downregulating quite a few pathways related to epithelial pathways compared to cluster 12 (putative basal cells); or conversely cluster 12 is upregulating epithelial pathways compared to cluster 10. 
+Based on these results, we could conclude that cluster 8 (putative luminal cells) have lower expression of quite a few pathways related to epithelial cell proliferation compared to cluster 12 (putative basal cells). 
 
 
 
diff --git a/_posts/0008-06-01-Cancer_cell_identification.md b/_posts/0008-06-01-Cancer_cell_identification.md
index 8845fcd..054fc22 100644
--- a/_posts/0008-06-01-Cancer_cell_identification.md
+++ b/_posts/0008-06-01-Cancer_cell_identification.md
@@ -9,5 +9,162 @@ feature_image: "assets/genvis-dna-bg_optimized_v1a.png"
 date: 0008-06-01
 ---
 
-## Cancer cell identification
+***
+
+Often when we're analyzing cancer samples using scRNAseq, we need to identify tumor cells from healthy cells. Usually, we know what kind of celltype we expect the tumor cells to be (epithelial cells for bladder cancer, B cells for a B cell lymphoma, etc.), but we may want to distinguish tumor cells from normal cells of the same celltype for various reasons such as DE analyses. Also, since we are drawing conclusions related to gene expression from these comparisons, we may want to identify tumor cells using methods that are orthogonal to the genes expressed by the tumor cells. To that end, two common methods used are identifying either mutations or copy number alterations in scRNAseq data. 
+
+In the case of mutations, one will typically carry out mutation calling from DNA sequencing data (ideally from tumor-normal paired samples) and then 'look' for those mutations in the scRNAseq data's BAM files using a tool like `VarTrix`. Because of the sparsity of single cell data, and its end bias (5' or 3' depending on the kit), it is unlikely that we can get every tumor cell from this approach, however it is likely that we can be pretty confident in the tumor cells we do identify. 
+
+For copy number alterations (CNAs), there are various tools that try to detect CNAs in tumor cells like [CONICSmat](https://github.com/diazlab/CONICS/tree/master) and [InferCNV](https://github.com/broadinstitute/infercnv). But they rely on a similar principle of using the counts matrix to identify regions of the genome that collectively have higher or lower expression and that if (say) 100+ genes in the same region have higher (or lower) expression it is likely that that is due to a copy number gain (or loss) as opposed to their being upregulated (or downregulated). 
+
+***
+
+### Finding tumor cells based on Copy number data
+
+If you are working in a tumor sample where you expect to find CNAs, looking for copy number alterations in the scRNAseq data can be one way to identify tumor cells. In our case, whole genome sequencing was done on the cell line used for the mouse models, so we have some confidently determined CNAs we expect to find in the scRNAseq data- 
+![CNV_LPWGS_scatterplot](/assets/module_8/CNV_scatterplot_fig2_manuscript.png)
+As we can see above, we expect to find gains in chromosome 2 and 11 and a loss in chromosome 12. 
+
+We will use CONICSmat to identify cells with CNAs in our scRNAseq data. While there is more information available on their [GitHub tutorial](https://github.com/diazlab/CONICS/wiki/Tutorial---CONICSmat;---Dataset:-SmartSeq2-scRNA-seq-of-Oligodendroglioma) and [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7190654/), briefly CONICSmat fits a two-component Gaussian mixture model for each chromosomal region, and uses a Bayesian Information Criterion (BIC) statistical test to ask if a 1-component model (all cells are the same and there's no CNA) fits better than a 2-component model (some cells have altered copy number compared to others). Here, a better fit is defined by a lower BIC score. Note that since we are only using the gene by cell counts matrix, average expression of genes in that chromosomal region is used as a proxy for a CNA. The key to most single-cell CNA based tools is that we need both tumor cells and non-tumor cells in our analysis as a copy number gain or loss in the tumor cells can only be measured relative to healthy cells. While CONICSmat can be run on all cells in the sample together, for our purposes, we will subset the object to Epithelial cells and B cells so that it can run more efficiently. CONICSmat is a somewhat computationally intensive tool, so we will start by clearing our workspace using the broom icon on the top right pane, and also click on the drop-down menu with a piechart beside it and select `Free unused memory`.
+
+```R
+#make sure you have cleared your workspace and 'freed unused space' otherwise you may run into issues later on!
+
+#start by loading in the libraries and the preprocessed Seurat object
+library("Seurat") 
+library("ggplot2")
+library("cowplot")
+library("dplyr") 
+library("Matrix")
+library("hdf5r")
+library("CONICSmat")
+library("showtext")
+
+merged <- readRDS('merged_processed_object.rds')
+
+#subset Seurat object to only include epithelial cells and B cells
+merged <- SetIdent(merged, value = 'immgen_singler_main')
+merged_subset <- subset(merged, idents=c('B cells', 'Epithelial cells'))
+
+#as always double check that the number of cells match our expectations
+table(merged$immgen_singler_main)
+table(merged_subset$immgen_singler_main)
+```
+
+Now we can look into running CONICSmat! It is run in a few steps and requires us to look at some outputs along the way and define some thresholds. Tha main inputs CONICSmat needs are a Seurat counts matrix, a `regions` file specifying genomic regions, and a `gene_pos`  file that has the genomic coordinates of all the genes. After we run the `plotAll` function, we will get a PDF file with results from the BIC test for every chromosomal region. We will look through this file to determine an appropriate BIC difference threshold that will help capture true CNA events. The BIC difference is (BIC 1 component score) - (BIC 2 component score), and a greater difference suggests a higher chance of a 'real' CNA event. Once we have a confident set of CNAs, we will use that information to cluster the cells in a heatmap wherein the heatmap is colored by z-scores calculated from the normalized average expression values.
+
+```R
+#Step 1 - Normalize Seurat counts matrix using CONICsmat normMat function
+conicsmat_expr <- CONICSmat::normMat(as.matrix(Seurat::GetAssayData(merged_subset, assay = 'RNA', layer = 'counts')))
+
+#Step 2 - Get chromosomal positions of genes in the expression matrix
+gene_pos=getGenePositions(rownames(conicsmat_expr), ensembl_version = "https://oct2022.archive.ensembl.org/", species = "mouse")
+
+#Step 3 - Filter out uninformative genes aka genes that are expressed in less than 5 cells (that was the default given by CONICSmat)
+conicsmat_expr=filterMatrix(conicsmat_expr,gene_pos[,"mgi_symbol"],minCells=5)
+
+#Step 4 - Calculate normalization factor for each cell- this centers the gene expression in each cell around the mean. (Need this because the more genes that are expressed in a cell, the less reads are 'available' per gene)
+normFactor=calcNormFactors(conicsmat_expr)
+
+#Step 5 - Fit the 2 component Gaussian mixture model for each region to determine if the region has a CNA. 
+## This step outputs a PDF with a page for each region in the regions file that we can look through to determine which regions are likely to have a CNA.
+## It also outputs a BIC_LR.txt file that summarizes the BIC scores and adj p-values for each region
+l=plotAll(conicsmat_expr,normFactor,regions,gene_pos,"outdir_single_cell_rna/conic_plotall")
+```
+
+Let's look through the `conic_plotall_CNVs.pdf` file. Looking through the results, we can see clear bimodal distributions for chromosomes 2, 11, and 12, and we see a corresponding lower BIC score for the 2 component model in these cases. However, we also see a low BIC score for cases like chromosome 5 even though we don't see a bimodal distribution, and this may contribute to noisy results when we apply a BIC threshold. Looking at these barplots, a BIC threshold of around 800 might help filter out the noise, so let's see the results with the default threshold (200) and 800. 
+
+```R
+#the BIC difference threshold will be applied using the text file that generated by the previous step. Let's read that in and look at it in the RStudio window
+lrbic <- read.table("outdir_single_cell_rna/conic_plotall_BIC_LR.txt",sep="\t",header=T,row.names=1,check.names=F)
+lrbic
+
+#filter candidate regions using default threshold 200
+candRegions_200 <- rownames(lrbic)[which(lrbic[,"BIC difference"]>200 & lrbic[,"LRT adj. p-val"]<0.01)]
+
+#plot a histogram and heatmap where we try to split it up into 2 cluster (ideally tumor and non tumor cells)
+plotHistogram(l[,candRegions_200],conicsmat_expr,clusters=2,zscoreThreshold=4)
+#note that this command outputs both a plot, and some barcodes with numbers. 
+#The barcode with numbers are basically the heatmap cluster ID that each barcode is assigned to. But we can ignore that for now.
+#did that split up the cells with the gain/losses well? Remember that based on the lpwgs data, we're expecting gains in chr2 and chr11, and a loss in chr12
+#If not, try it with a different numbers of clusters (4, 8, 12 etc)
+
+#Now filter candidates using threshold 800
+candRegions_800 <- rownames(lrbic)[which(lrbic[,"BIC difference"]>800 & lrbic[,"LRT adj. p-val"]<0.01)]
+
+#plot a histogram and heatmap where we try to split it up into 2 cluster (ideally tumor and non tumor cells)
+plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=2,zscoreThreshold=4)
+#how does that look? Maybe try increasing the clusters a little to get it to split up nicely.
+
+#Once you have a cluster split you like, save the barcode-cluster ID list to a variable for use later. Also save the plot as a PDF. (Note if the save doesn't work, you may need to run dev.off() a few times until you get a message saying null device 1)
+pdf("outdir/conic_plot_histogram_cand_regions_3clusters.pdf", width=5, height=5)
+hi <- plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=3,zscoreThreshold=4)
+dev.off()
+hi
+#Also note the cluster with putative malignant cells from the heatmap (note that the cluster IDs are not always in order, refer to the text in grey on the right to identify the clusters)
+
+#now convert the hi named vector to a dataframe
+hi_df <- data.frame(hi) 
+#look at hi_df in RStudio
+
+#add a new column for tumor cell status to the barcodes based on the cluster you've determined to be the tumor cells
+hi_df$tumor_cell_classification <- ifelse(hi_df$hi=='2', 'cnv-tumor cell', 'cnv-not tumor cell')
+#look at hi_df dataframe in RStudio again
+
+#double-check our classifications worked as expected
+table(hi_df$hi, hi_df$tumor_cell_classification)
+
+#let's also see where all the cells cluster on the UMAP. 
+#for this we can use the DimPlot function and provide it with a list of cell barcodes that we want to color separately
+DimPlot(merged, cells.highlight = rownames(hi_df[hi_df$tumor_cell_classification == 'cnv-tumor cell',])) + #breakdown the argument given to cells.highlight
+  DimPlot(merged, group.by = 'immgen_singler_main') 
+```
+
+Looks like most of our tumor cells based on the CNV classification are in the epithelial cells cluster as we'd expect! A key point about CONICSmat here is that, the tool itself doesn't determine a tumor cell from a non-tumor cell. In this case we knew from the low pass whole genome data that the tumor cells should have gains in chromosomes 2 and 11 and a loss in chromosome 12. But if we did not have that information, we can overlay our celltypes on the heatmap and determine CNAs as the CNAs should primarily arise in the tumor's celltype. 
+
+```R
+# using the celltypes argument we can add celltypes to the Seurat object. 
+plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=3,zscoreThreshold=4, celltypes = merged_subset$immgen_singler_main)
+#note that we got the cell barcodes from merged_subset as that was the object used to generate the initial conicsmat_expr matrix.
+```
+
+We will look into adding this information to the Seurat object after we have the SNV tumor calls as well. 
+
+### Finding tumor cells based on mutation data
+
+We will use [VarTrix](https://github.com/10XGenomics/vartrix) to identify cells with mutations in our scRNAseq data. However, VarTrix is a tool that is run at the commandline level, so we will not run it in this workshop, instead we will go over the command to run it and how the outputs were processed here, and then plot the resulting mutation calls in RStudio. As input, VarTrix takes the scRNAseq BAM and barcodes.tsv files output from CellRanger, along with a VCF file containing the variants, and a reference genome fasta file. It can be run in 3 `--scoring-method` modes, here we chose `coverage`. This mode generates 2 output matrices- the `ref-matrix` and `out-matrix`. The former summarizes the number of REF reads (reads matching wild-type) observed for every cell barcode and variant, while the latter summarizes the same for ALT reads (reads matching mutation). Even though we will not be running it in this workshop, the vartrix command below was run for each replicate separately.
+
+```bash
+vartrix --vcf [input VCF file (unique to each replicate)] \
+ --bam [cellranger bam file (unique to each replicate)] \
+ --fasta [mm10 mouse reference genome] \
+ --cell-barcodes [barcodes.tsv (unique to each replicate)] \
+ --out-matrix [name of output ALT reads matrix] \
+ --ref-matrix [name of output REF reads matrix] \
+ --out-variants [name of output summarizing variants] -s coverage
+```  
+
+The first few lines of one of the output matrices are shown below (both matrices are formatted the same way), where the first column identifies the variant number, the second column identifies the cell barcode, and the third column indicates the number of reads (REF or ALT depending on the matrix) for that variant-
+```
+%%MatrixMarket matrix coordinate real general
+% written by sprs
+16449 4920 292931
+5 199 6
+5 1209 0
+5 2198 0
+5 2673 0
+5 4560 0
+6 117 1
+6 1000 4
+... 
+```
+
+The processing of these matrices to identify variant containing cells depends on the data. In this case, the mutation calls were noisy, so the authors required a cell to have at least 2 variants with greater than 20X total coverage, with at least 5 ALT reads, and 10% VAF. However, if one has a confident set of tumor specific variants, the criteria to classify tumor specific cells can be less stringent. We will not go over the data processing steps here, but briefly for every sample 
+
+
+
+
+
+
+
 
diff --git a/_posts/0009-09-03-POSIT_Setup.md b/_posts/0009-09-03-POSIT_Setup.md
index c756fd9..b2149fa 100644
--- a/_posts/0009-09-03-POSIT_Setup.md
+++ b/_posts/0009-09-03-POSIT_Setup.md
@@ -22,6 +22,7 @@ Folders for uploading raw data were created using the RStudio terminal. Files we
 ```bash
 mkdir data
 mkdir outdir
+mkdir outdir_single_cell_rna
 mkdir package_installation
 
 cd data
@@ -33,8 +34,8 @@ mkdir bulk_rna
 - CellRanger outputs for reps1,3,5 (uploaded from `/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/counts_gex/sample_filtered_feature_bc_matrix.h5.zip`)
 - BCR and TCR clonotypes (uploaded from `/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_b_posit.zip` and `/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_t_posit.zip`)
 - MSigDB `M8: cell type signature gene sets` (downloaded GMT file from [MSigDB website](https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2023.2.Mm/m8.all.v2023.2.Mm.symbols.gmt) to laptop and then uploaded to single_cell_rna folder)
-- InferCNV Gene ordering files (download from TrinityCTAT - [annotation by gene id file](https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_id.infercnv_positions) and [annotation by gene name file](https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_name.infercnv_positions))
-- Vartrix file with barcodes and tumor calls (uploaded from `/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/Tumor_Calls_per_Variants_for_CRI.tsv`)
+- CONICSmat mm10 chr arms positions file (downloaded file from CONICSmat GitHub - [chromosome_full_positions_mm10.txt](https://github.com/diazlab/CONICS/blob/master/chromosome_full_positions_mm10.txt) to laptop and then uploaded to single_cell_rna folder)
+- Vartrix file with barcodes and tumor calls (uploaded from `/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/Tumor_Calls_per_Variants_for_CRI_Updated_Barcodes.tsv`)
 
 Posit requires all files to be zipped prior to uploading and automatically unzips the folder after the upload. After uploading the files, made a folder for the cellranger outputs, and moved the `.h5` files there. Will also download inferCNV files using `wget`
 ```bash
@@ -108,11 +109,23 @@ BiocManager::install('HDF5Array')
 BiocManager::install('terra')
 BiocManager::install('ggrastr')
 devtools::install_github('cole-trapnell-lab/monocle3')
+install.packages("beanplot")
+install.packages("mixtools")
+install.packages("pheatmap")
+install.packages("zoo")
+install.packages("squash")
+install.packages("showtext")
+BiocManager::install("biomaRt")
+BiocManager::install("scran")
+devtools::install_github("diazlab/CONICS/CONICSmat", dep = FALSE)
+install.packages("gprofiler2")
+
 
 # Bulk RNA seq libraries
 BiocManager::install("genefilter")
 install.packages("dplyr")
 install.packages("ggplot2")
+install.packages("data.table")
 BiocManager::install("AnnotationDbi")
 BiocManager::install("org.Hs.eg.db")
 BiocManager::install("GO.db")
@@ -124,7 +137,6 @@ install.packages("UpSetR")
 BiocManager::install("DESeq2")
 install.packages('gtable')
 BiocManager::install("apeglm")
-
 ```
 
 
diff --git a/assets/module_8/CNV_scatterplot_fig2_manuscript.png b/assets/module_8/CNV_scatterplot_fig2_manuscript.png
new file mode 100644
index 0000000..7163235
Binary files /dev/null and b/assets/module_8/CNV_scatterplot_fig2_manuscript.png differ