Skip to content

Commit

Permalink
markdown source builds
Browse files Browse the repository at this point in the history
Auto-generated via {sandpaper}
Source  : c8c5bf6
Branch  : main
Author  : Andrew Ghazi <[email protected]>
Time    : 2024-10-04 16:21:42 +0000
Message : Merge pull request #51 from ccb-hms/ms_spec_ex

model specification exercise
  • Loading branch information
actions-user committed Oct 4, 2024
1 parent a651db9 commit cb39e46
Show file tree
Hide file tree
Showing 5 changed files with 60 additions and 20 deletions.
Binary file added fig/multi-sample-rendered-unnamed-chunk-24-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified fig/multi-sample-rendered-unnamed-chunk-25-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/multi-sample-rendered-unnamed-chunk-25-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion md5sum.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"episodes/intro-sce.Rmd" "b704934867b22d804de1e0fa0a9600eb" "site/built/intro-sce.md" "2024-10-02"
"episodes/eda_qc.Rmd" "d1a9a8a578fff2b0e9566ec613da388d" "site/built/eda_qc.md" "2024-10-03"
"episodes/cell_type_annotation.Rmd" "8fab6e0cbb60d6fe6a67a0004c5ce5ab" "site/built/cell_type_annotation.md" "2024-10-02"
"episodes/multi-sample.Rmd" "f21e7e5dc2ad020ee739374905582225" "site/built/multi-sample.md" "2024-10-03"
"episodes/multi-sample.Rmd" "6d47bd1941b7a83000873da8d836fba0" "site/built/multi-sample.md" "2024-10-04"
"episodes/large_data.Rmd" "f19fa53e9e63d4cb8fe0f6ab61c8fc3a" "site/built/large_data.md" "2024-10-02"
"episodes/hca.Rmd" "247c19cda5e903d2eb1a5dd547d95506" "site/built/hca.md" "2024-10-03"
"instructors/instructor-notes.md" "205339793f625a1844a768dea8e4a9c8" "site/built/instructor-notes.md" "2024-09-24"
Expand Down
78 changes: 59 additions & 19 deletions multi-sample.md
Original file line number Diff line number Diff line change
Expand Up @@ -903,61 +903,101 @@ loaded via a namespace (and not attached):

## Exercises

:::::::::::::::::::::::::::::::::: challenge

#### Exercise 1: Replicates
:::::::::::::::::::::::::::::::::: challenge

Test differential expressed genes with just 2 replicates per condition and look into the changes in the results and possible emerging issues.
#### Exercise 1: Heatmaps

Use the `pheatmap` package to create a heatmap of the abundances table. Does it comport with the model results?

:::::::::::::: hint

Remember, you can subset SingleCellExperiments with logical indices, just like a matrix. You can also access their column data with the `$` accessor, like a data frame.
You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has a million options you can tweak, but the defaults are usually pretty good.

:::::::::::::::::::::::

:::::::::::::: solution



``` r
summed.filt.subset = summed.filt[,summed.filt$pool != 3]

de.results <- pseudoBulkDGE(
summed.filt.subset,
label = summed.filt.subset$celltype.mapped,
design = ~factor(pool) + tomato,
coef = "tomatoTRUE",
condition = summed.filt.subset$tomato
)
pheatmap(y.ab$counts)
```

<img src="fig/multi-sample-rendered-unnamed-chunk-24-1.png" style="display: block; margin: auto;" />

The top DA result was a decrease in ExE ectoderm in the tomato condition, which you can sort of see, especially if you `log1p()` the counts or discard rows that show much higher values. ExE ectoderm counts were much higher in samples 8 and 10 compared to 5, 7, and 9.

:::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::: challenge

#### Exercise 2: Heatmaps
#### Exercise 2: Model specification and comparison

Use the `pheatmap` package to create a heatmap of the abundances table. Does it comport with the model results?
Try re-running the pseudobulk DGE without the `pool` factor in the design specification. Compare the logFC estimates and the distribution of p-values for the `Erythroid3` cell type.

:::::::::::::: hint

You can just hand `pheatmap()` a matrix as its only argument. It has a million options, but the defaults are usually pretty good.
After running the second pseudobulk DGE, you can join the two `DataFrame`s of `Erythroid3` statistics using the `merge()` function. You will need to create a common key column from the gene IDs.

:::::::::::::::::::::::

:::::::::::::: solution



``` r
pheatmap(y.ab$counts)
de.results2 <- pseudoBulkDGE(
summed.filt,
label = summed.filt$celltype.mapped,
design = ~tomato,
coef = "tomatoTRUE",
condition = summed.filt$tomato
)

eryth1 <- de.results$Erythroid3

eryth2 <- de.results2$Erythroid3

eryth1$gene <- rownames(eryth1)

eryth2$gene <- rownames(eryth2)

comp_df <- merge(eryth1, eryth2, by = 'gene')

comp_df <- comp_df[!is.na(comp_df$logFC.x),]

ggplot(comp_df, aes(logFC.x, logFC.y)) +
geom_abline(lty = 2, color = "grey") +
geom_point()
```

<img src="fig/multi-sample-rendered-unnamed-chunk-25-1.png" style="display: block; margin: auto;" />

The top DA result was a decrease in ExE ectoderm in the tomato condition, which you can sort of see, especially if you `log1p()` the counts or discard rows that show much higher values. ExE ectoderm counts were much higher in samples 8 and 10 compared to 5, 7, and 9.
``` r
# Reshape to long format for ggplot facets. This is 1000x times easier to do
# with tidyverse packages:
pval_df <- reshape(comp_df[,c("gene", "PValue.x", "PValue.y")],
direction = "long",
v.names = "Pvalue",
timevar = "pool_factor",
times = c("with pool factor", "no pool factor"),
varying = c("PValue.x", "PValue.y"))

ggplot(pval_df, aes(Pvalue)) +
geom_histogram(boundary = 0,
bins = 30) +
facet_wrap("pool_factor")
```

<img src="fig/multi-sample-rendered-unnamed-chunk-25-2.png" style="display: block; margin: auto;" />

We can see that in this case, the logFC estimates are strongly consistent between the two models, which tells us that the inclusion of the `pool` factor in the model doesn't strongly influence the estimate of the `tomato` coefficients in this case.

The p-value histograms both look alright here, with a largely flat plateau over most of the 0 - 1 range and a spike near 0. This is consistent with the hypothesis that most genes are unaffected by `tomato` but there are a small handful that clearly are.

If there were large shifts in the logFC estimates or p-value distributions, that's a sign that the design specification change has a large impact on how the model sees the data. If that happens, you'll need to think carefully and critically about what variables should and should not be included in the model formula.

:::::::::::::::::::::::

Expand Down

0 comments on commit cb39e46

Please sign in to comment.