-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathWorkshop2_DESeq.qmd
348 lines (255 loc) · 12 KB
/
Workshop2_DESeq.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
title: "AD Knowledge Portal Workshop: Differential Expression Analysis of 5xFAD mouse models"
date: "`r Sys.Date()`"
author:
- Laura Heath & Jaclyn Beck (Sage Bionetworks)
- Adapted from code written by Ravi Pandey (Jackson Laboratories)
format:
html:
toc: true
toc-depth: 3
df-print: paged
knit: (function(input_file, encoding) {
out_dir <- 'docs';
rmarkdown::render(input_file,
encoding=encoding,
output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
---
This notebook will take the raw counts matrix and metadata files we downloaded in the first part of the workshop (`5XFAD_data_R_tutorial.qmd`) to run a basic differential expression analysis on a single time point (12 months) in male mice. You can amend the code to compare wild type and 5XFAD mice from either sex, at any time point. For a more in-depth tutorial on DESeq2 and how to handle more complicated experimental setups, see [this vignette](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) on DESeq2.
The data used in this notebook is obtained from The Jax.IU.Pitt_5XFAD Study (Jax.IU.Pitt_5XFAD), which can be found [here on the AD Knowledge Portal](https://adknowledgeportal.synapse.org/Explore/Studies/DetailsPage/StudyData?Study=syn21983020).
------------------------------------------------------------------------
## Setup
```{r, set-opts, include=FALSE}
knitr::opts_chunk$set(
eval = TRUE,
print.rows = 10
)
```
### Install and load packages
We will need several new packages from Bioconductor to run this analysis:
```{r install-packages, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("DESeq2", "EnhancedVolcano"))
```
If not already installed, be sure to install the `synapser`, `tidyverse`, and `lubridate` packages from part 1 of this workshop.
Load necessary libraries.
```{r load-libraries, message=FALSE, warning=TRUE}
library(DESeq2)
library(ggplot2)
library(EnhancedVolcano)
library(dplyr)
library(synapser)
library(readr)
library(tibble)
library(lubridate)
```
------------------------------------------------------------------------
## Download counts and metadata from Synapse
The code below is a (more condensed) repeat of the code from Part 1 of the workshop (`5XFAD_data_r_tutorial.qmd`) that fetches the counts file and metadata files. If you have just run that notebook and still have `counts` and `covars` in your environment, you likely do not need to re-run the code below and can skip to [Modify the data for analysis].
First, we log in to Syanpse:
```{r synlogin_run, include = FALSE}
# This executes the code without showing the printed welcome statements. The
# next block will show the code but not run it.
synLogin()
```
```{r synlogin, eval=FALSE}
synLogin()
```
Then, we fetch the counts and metadata files. As mentioned in part 1, it is good practice to assign the output of `synGet` to a variable and use `variable$path` to reference the file name, as done below. For this part of the workshop, we will skip the bulk download step for metadata files and instead download each file by ID.
```{r download-data, results="hide"}
# counts
counts_id <- "syn22108847"
counts_file <- synGet(counts_id,
downloadLocation = "files/",
ifcollision = "overwrite.local")
counts <- read_tsv(counts_file$path, show_col_types = FALSE)
# individual metadata
ind_metaID <- "syn22103212"
ind_file <- synGet(ind_metaID,
downloadLocation = "files/",
ifcollision = "overwrite.local")
ind_meta <- read_csv(ind_file$path, show_col_types = FALSE)
# biospecimen metadata
bio_metaID <- "syn22103213"
bio_file <- synGet(bio_metaID,
downloadLocation = "files/",
ifcollision = "overwrite.local")
bio_meta <- read_csv(bio_file$path, show_col_types = FALSE)
# RNA assay metadata
rna_metaID <- "syn22110328"
rna_file <- synGet(rna_metaID,
downloadLocation = "files/",
ifcollision = "overwrite.local")
rna_meta <- read_csv(rna_file$path, show_col_types = FALSE)
```
### Join metadata files together
Join the three metadata files by IDs in common so we can associate the column names of `counts` (which are specimenIDs) with individual mice from the individual metadata file.
```{r join-metadata}
joined_meta <- rna_meta |> # start with the rnaseq assay metadata
# join rows from biospecimen that match specimenID
left_join(bio_meta, by = "specimenID") |>
# join rows from individual that match individualID
left_join(ind_meta, by = "individualID")
```
Create a timepoint variable (months since birth) from the `dateBirth` and `dateDeath` fields in the metadata.
```{r create-age-death}
# convert columns of strings to month-date-year format using lubridate
joined_meta_time <- joined_meta |>
mutate(dateBirth = mdy(dateBirth),
dateDeath = mdy(dateDeath)) |>
# create a new column that subtracts dateBirth from dateDeath in days, then
# divide by 30 to get months
mutate(timepoint = as.numeric(difftime(dateDeath, dateBirth,
units = "days")) / 30) |>
# convert numeric ages to timepoint categories
mutate(timepoint = case_when(timepoint > 10 ~ "12 mo",
timepoint < 10 & timepoint > 5 ~ "6 mo",
timepoint < 5 ~ "4 mo"))
# check that the timepoint column looks ok (should be 6 mice in each group)
joined_meta_time |>
group_by(sex, genotype, timepoint) |>
count()
```
Select the covariates needed for the analysis
```{r select-covars}
covars <- joined_meta_time |>
dplyr::select(individualID, specimenID, sex, genotype, timepoint)
covars
```
Utility function that maps Ensembl IDs to gene symbols (copied from Part 1)
```{r map-function}
# Assumes that the rownames of "df" are the Ensembl IDs
map_ensembl_ids <- function(df) {
ensembl_to_gene <- read.csv(file = "ensembl_translation_key.csv")
mapped_df <- df |>
# Make a gene_id column that matches the ensembl_to_gene table
rownames_to_column("gene_id") |>
dplyr::left_join(ensembl_to_gene, by = "gene_id") |>
relocate(gene_name, .after = gene_id)
# The first two genes in the matrix are the humanized genes PSEN1
# (ENSG00000080815) and APP (ENSG00000142192). Set these manually:
mapped_df[1, "gene_name"] <- "PSEN1"
mapped_df[2, "gene_name"] <- "APP"
return(mapped_df)
}
```
**End of repeated code from part 1.**
## Modify the data for analysis
Clean up the `covars` data: coerce covars into a dataframe, label the rows by specimenID, and check the result
```{r covars-cleanup}
covars <- as.data.frame(covars)
rownames(covars) <- covars$specimenID
covars
```
Order the data (counts columns and metadata rows MUST be in the same order), and subset the counts matrix and metadata to include only 12 month old male mice
```{r subset-12m-male}
meta.12M.Male <- covars |>
subset(sex == "male" & timepoint == "12 mo")
# Subsets counts to only the 12 month male samples, and puts the samples in the
# same order they appear in meta.12M.Male
counts.12M.Male <- counts |>
# Set the rownames to the gene ID, remove "gene_id" column
column_to_rownames("gene_id") |>
# Only use columns that appear in meta.12M.Male
select(meta.12M.Male$specimenID)
```
This leaves us with 12 samples, 6 per genotype:
```{r check-subset}
meta.12M.Male |>
group_by(sex, genotype, timepoint) |>
dplyr::count()
```
Verify that the columns in `counts.12M.Male` are in the same order as the specimenIDs in
`meta.12M.Male`:
```{r verify-colnames}
all(colnames(counts.12M.Male) == meta.12M.Male$specimenID)
```
We should now have:
1. A data.frame of metadata for 12-month-old male mice, one row per specimen
2. A matrix of counts where each row is a gene and each column is a single
specimen
We can now analyze this data with DESeq2.
------------------------------------------------------------------------
## Differential Expression Analysis using DESeq2
Set up data for analysis. All samples are male, 12 month old mice, so we are only interested in looking at the effect of genotype on the data. We specify this for DESeq2 by setting the `design` argument to `~ genotype`. This tells DESeq2 to use the linear model `expression ~ genotype` when solving for the effect of genotype. For this to work properly, we need to make sure `genotype` is a factor so DESeq2 knows it is a categorical value.
```{r make-deseq2-obj, message=FALSE}
meta.12M.Male$genotype <- factor(meta.12M.Male$genotype)
ddsHTSeq <- DESeqDataSetFromMatrix(countData = counts.12M.Male,
colData = meta.12M.Male,
design = ~ genotype)
```
*Note on R formula syntax:* Using `~ variable` will automatically be expanded to represent the linear equation `expression ~ (beta1 * 1) + (beta2 * variable)`, where 1 represents the intercept of the equation and `beta1` and `beta2` are the coefficients that should be estimated.
For more complicated formula setup, refer to the [DESeq2 vignette](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
**Back to analysis...**
Filter out genes that have all zero counts across all samples. You can also use more stringent criteria like only keeping genes that have at least *X* counts in at least *Y* samples, but for this workshop we will just remove zero-genes.
```{r filter-genes}
paste("Total genes before filtering:", nrow(ddsHTSeq))
ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)) >= 1, ]
paste("Total genes after filtering:", nrow(ddsHTSeq))
```
Set wild-type mice (5XFAD_noncarrier) as the reference genotype, so the comparison is 5XFAD_carrier - 5XFAD_noncarrier.
```{r relevel-genotype}
ddsHTSeq$genotype <- relevel(ddsHTSeq$genotype, ref = "5XFAD_noncarrier")
```
### Run DESeq
This function normalizes the read counts, estimates dispersions, and fits the linear model using the formula we specified in `design` above (`~ genotype`)
```{r run-deseq2, results = "hide", message=FALSE}
dds <- DESeq(ddsHTSeq, parallel = TRUE)
```
### Extract a table of results
The significance threshold can be set using the `alpha` argument of the `results` function. Here we use 0.05.
```{r get-results}
res <- results(dds, alpha = 0.05)
summary(res)
head(as.data.frame(res))
```
Add gene symbols to the results
```{r add-gene-symbols}
res <- map_ensembl_ids(as.data.frame(res))
```
What are some of the top up-regulated genes?
```{r upreg-genes}
res |>
subset(padj < 0.05) |>
slice_max(order_by = log2FoldChange, n = 10) %>%
select(gene_id, gene_name, log2FoldChange, padj)
```
What are some of the top down-regulated genes?
```{r downreg-genes}
res |>
subset(padj < 0.05) |>
slice_min(order_by = log2FoldChange, n = 10) %>%
select(gene_id, gene_name, log2FoldChange, padj)
```
------------------------------------------------------------------------
## Plot results
Volcano plot of differential expression results: all genes with p \< 0.05 and log2FC \> 0.5
```{r volcano-plot, warning=FALSE}
plot_DEGvolcano <- EnhancedVolcano(res,
lab = res$gene_name,
x = 'log2FoldChange',
y = 'padj',
legendPosition = 'none',
title = 'DE Results of 12 mo. old Male Mice',
subtitle = '',
FCcutoff = 0.5,
pCutoff = 0.05,
xlim = c(-3, 17),
pointSize = 1,
labSize = 4)
plot_DEGvolcano
```
Save results table and plot:
```{r save-results, results = "hide"}
write.csv(res, file="5XFAD_DEresults_12mo_males.csv", row.names=FALSE)
ggsave("VolcanoPlot.png", width = 8, height = 6, units = "in")
```
------------------------------------------------------------------------
{width="236"}
<details>
<summary>R Package Info</summary>
```{r session-info}
sessionInfo()
```
</details>