markdown source builds

Auto-generated via {sandpaper} Source : c8c5bf6 Branch : main Author : Andrew Ghazi <[email protected]> Time : 2024-10-04 16:21:42 +0000 Message : Merge pull request #51 from ccb-hms/ms_spec_ex model specification exercise
carpentries-incubator · Oct 4, 2024 · cb39e46 · cb39e46
1 parent a651db9
commit cb39e46
Show file tree

Hide file tree

Showing 5 changed files with 60 additions and 20 deletions.
diff --git a/fig/multi-sample-rendered-unnamed-chunk-24-1.png b/fig/multi-sample-rendered-unnamed-chunk-24-1.png
diff --git a/fig/multi-sample-rendered-unnamed-chunk-25-1.png b/fig/multi-sample-rendered-unnamed-chunk-25-1.png
diff --git a/fig/multi-sample-rendered-unnamed-chunk-25-2.png b/fig/multi-sample-rendered-unnamed-chunk-25-2.png
diff --git a/md5sum.txt b/md5sum.txt
@@ -7,7 +7,7 @@
 "episodes/intro-sce.Rmd" "b704934867b22d804de1e0fa0a9600eb" "site/built/intro-sce.md" "2024-10-02"
 "episodes/eda_qc.Rmd" "d1a9a8a578fff2b0e9566ec613da388d" "site/built/eda_qc.md" "2024-10-03"
 "episodes/cell_type_annotation.Rmd" "8fab6e0cbb60d6fe6a67a0004c5ce5ab" "site/built/cell_type_annotation.md" "2024-10-02"
-"episodes/multi-sample.Rmd" "f21e7e5dc2ad020ee739374905582225" "site/built/multi-sample.md" "2024-10-03"
+"episodes/multi-sample.Rmd" "6d47bd1941b7a83000873da8d836fba0" "site/built/multi-sample.md" "2024-10-04"
 "episodes/large_data.Rmd" "f19fa53e9e63d4cb8fe0f6ab61c8fc3a" "site/built/large_data.md" "2024-10-02"
 "episodes/hca.Rmd" "247c19cda5e903d2eb1a5dd547d95506" "site/built/hca.md" "2024-10-03"
 "instructors/instructor-notes.md" "205339793f625a1844a768dea8e4a9c8" "site/built/instructor-notes.md" "2024-09-24"

diff --git a/multi-sample.md b/multi-sample.md
@@ -903,61 +903,101 @@ loaded via a namespace (and not attached):
 
 ## Exercises
 
-:::::::::::::::::::::::::::::::::: challenge
 
-#### Exercise 1: Replicates
+:::::::::::::::::::::::::::::::::: challenge
 
-Test differential expressed genes with just 2 replicates per condition and look into the changes in the results and possible emerging issues.
+#### Exercise 1: Heatmaps
 
+Use the `pheatmap` package to create a heatmap of the abundances table. Does it comport with the model results?
 
 :::::::::::::: hint
 
-Remember, you can subset SingleCellExperiments with logical indices, just like a matrix. You can also access their column data with the `$` accessor, like a data frame. 
+You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has a million options you can tweak, but the defaults are usually pretty good.
 
 :::::::::::::::::::::::
 
 :::::::::::::: solution
 
 
-
 ``` r
-summed.filt.subset = summed.filt[,summed.filt$pool != 3]
-
-de.results <- pseudoBulkDGE(
-    summed.filt.subset, 
-    label = summed.filt.subset$celltype.mapped,
-    design = ~factor(pool) + tomato,
-    coef = "tomatoTRUE",
-    condition = summed.filt.subset$tomato 
-)
+pheatmap(y.ab$counts)
 ```
 
+<img src="fig/multi-sample-rendered-unnamed-chunk-24-1.png" style="display: block; margin: auto;" />
+
+The top DA result was a decrease in ExE ectoderm in the tomato condition, which you can sort of see, especially if you `log1p()` the counts or discard rows that show much higher values. ExE ectoderm counts were much higher in samples 8 and 10 compared to 5, 7, and 9. 
+
 :::::::::::::::::::::::
 
 :::::::::::::::::::::::::::::::::::::::::::::
 
 :::::::::::::::::::::::::::::::::: challenge
 
-#### Exercise 2: Heatmaps
+#### Exercise 2: Model specification and comparison
 
-Use the `pheatmap` package to create a heatmap of the abundances table. Does it comport with the model results?
+Try re-running the pseudobulk DGE without the `pool` factor in the design specification. Compare the logFC estimates and the distribution of p-values for the `Erythroid3` cell type.
 
 :::::::::::::: hint
 
-You can just hand `pheatmap()` a matrix as its only argument. It has a million options, but the defaults are usually pretty good.
+After running the second pseudobulk DGE, you can join the two `DataFrame`s of `Erythroid3` statistics using the `merge()` function. You will need to create a common key column from the gene IDs.
 
 :::::::::::::::::::::::
 
 :::::::::::::: solution
 
 
+
 ``` r
-pheatmap(y.ab$counts)
+de.results2 <- pseudoBulkDGE(
+    summed.filt, 
+    label = summed.filt$celltype.mapped,
+    design = ~tomato,
+    coef = "tomatoTRUE",
+    condition = summed.filt$tomato 
+)
+
+eryth1 <- de.results$Erythroid3
+
+eryth2 <- de.results2$Erythroid3
+
+eryth1$gene <- rownames(eryth1)
+
+eryth2$gene <- rownames(eryth2)
+
+comp_df <- merge(eryth1, eryth2, by = 'gene')
+
+comp_df <- comp_df[!is.na(comp_df$logFC.x),]
+
+ggplot(comp_df, aes(logFC.x, logFC.y)) + 
+    geom_abline(lty = 2, color = "grey") +
+    geom_point() 
 ```
 
 <img src="fig/multi-sample-rendered-unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
 
-The top DA result was a decrease in ExE ectoderm in the tomato condition, which you can sort of see, especially if you `log1p()` the counts or discard rows that show much higher values. ExE ectoderm counts were much higher in samples 8 and 10 compared to 5, 7, and 9. 
+``` r
+# Reshape to long format for ggplot facets. This is 1000x times easier to do
+# with tidyverse packages:
+pval_df <- reshape(comp_df[,c("gene", "PValue.x", "PValue.y")],
+                   direction = "long", 
+                   v.names = "Pvalue",
+                   timevar = "pool_factor",
+                   times = c("with pool factor", "no pool factor"),
+                   varying = c("PValue.x", "PValue.y"))
+
+ggplot(pval_df, aes(Pvalue)) + 
+    geom_histogram(boundary = 0,
+                   bins = 30) + 
+    facet_wrap("pool_factor")
+```
+
+<img src="fig/multi-sample-rendered-unnamed-chunk-25-2.png" style="display: block; margin: auto;" />
+
+We can see that in this case, the logFC estimates are strongly consistent between the two models, which tells us that the inclusion of the `pool` factor in the model doesn't strongly influence the estimate of the `tomato` coefficients in this case.
+
+The p-value histograms both look alright here, with a largely flat plateau over most of the 0 - 1 range and a spike near 0. This is consistent with the hypothesis that most genes are unaffected by `tomato` but there are a small handful that clearly are. 
+
+If there were large shifts in the logFC estimates or p-value distributions, that's a sign that the design specification change has a large impact on how the model sees the data. If that happens, you'll need to think carefully and critically about what variables should and should not be included in the model formula.
 
 :::::::::::::::::::::::