diff --git a/_posts/0003-03-02-Differential_Expression-edgeR.md b/_posts/0003-03-02-Differential_Expression-edgeR.md index 816bd64..ee01eab 100644 --- a/_posts/0003-03-02-Differential_Expression-edgeR.md +++ b/_posts/0003-03-02-Differential_Expression-edgeR.md @@ -23,7 +23,7 @@ If you would like a brief refresher on differential expression analysis, please ### edgeR DE Analysis In this tutorial you will: -* Make use of the raw counts you generated above using htseq-count +* Make use of the raw counts you generated previously using htseq-count * edgeR is a bioconductor package designed specifically for differential expression of count-based RNA-seq data * This is an alternative to using stringtie/ballgown to find differentially expressed genes diff --git a/_posts/0003-03-03-Differential_Expression-DESeq2.md b/_posts/0003-03-03-Differential_Expression-DESeq2.md index 53cdd40..36df1a1 100644 --- a/_posts/0003-03-03-Differential_Expression-DESeq2.md +++ b/_posts/0003-03-03-Differential_Expression-DESeq2.md @@ -23,13 +23,13 @@ If you would like a brief refresher on differential expression analysis, please ### DESeq2 DE Analysis In this tutorial you will: -* Make use of the raw counts you generated above using htseq-count +* Make use of the raw counts you generated previously using htseq-count * DESeq2 is a bioconductor package designed specifically for differential expression of count-based RNA-seq data * This is an alternative to using stringtie/ballgown to find differentially expressed genes ### Setup -Here we launch R, install relevant packages (if needed), set working directories and read in data. Two pieces of information are required to perform analysis with DESeq2. A matrix of raw counts, such as was generated previously while running [HTseq](https://htseq.readthedocs.io/en/release_0.9.0/) previously in this course. This is important as DESeq2 normalizes the data, correcting for differences in library size using using this type of data. DESeq2 also requires the experimental design which can be supplied as a data.frame, detailing the samples and conditions. +Here we launch R, install relevant packages (if needed), set working directories and read in the raw read counts data. Two pieces of information are required to perform analysis with DESeq2. A matrix of raw counts, such as was generated previously while running [HTseq](https://htseq.readthedocs.io/en/release_0.9.0/) previously in this course. This is important as DESeq2 normalizes the data, correcting for differences in library size using using this type of data. DESeq2 also requires the experimental design which can be supplied as a data.frame, detailing the samples and conditions. Launch R: @@ -38,7 +38,7 @@ R ``` ```R -# # Install the latest version of DEseq2 +# Install the latest version of DEseq2 # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("DESeq2", version = "3.8") @@ -59,7 +59,9 @@ htseqCounts <- fread("gene_read_counts_table_all_final.tsv") ``` ### Format htseq counts data to work with DESeq2 -DESeq2 has a number of options to start with and actually has a function to read direct HTseq output files. Here the most universal option is used, reading in raw counts from a matrix. The HTseq count data that was read in above is an object of class data.table, this can be verified with the `class()` function, so it is required to convert to an appropriate matrix with gene names as rows and samples as columns. It should be noted that while the replicate samples are technical replicates (i.e. the same library was sequenced), herein they are treated as biological replicates for illustrative purposes. DESeq2 does have a function to collapse technical replicates though. +DESeq2 has a number of options for data import and actually has a function to read HTseq output files directly. Here the most universal option is used, reading in raw counts from a matrix in simple TSV format (one row per gene, one column per sample). The HTseq count data that was read in above is stored as an object of class data.table, this can be verified with the `class()` function. To use in this exercise it is required to convert this object to an appropriate matrix format with gene names as rows and samples as columns. + +It should be noted that while the replicate samples are technical replicates (i.e. the same library was sequenced), herein they are treated as biological replicates for illustrative purposes. DESeq2 does have a function to collapse technical replicates though. ```R @@ -76,39 +78,42 @@ rownames(htseqCounts) <- htseqCounts[,"GeneID"] # now that the gene IDs are the row names, remove the redundant column that contains them htseqCounts <- htseqCounts[, colnames(htseqCounts) != "GeneID"] -# convert the actual count values from strings (with spaces) to integers, because originally the gene column contained characters the entire matrix was set to character +# convert the actual count values from strings (with spaces) to integers, because originally the gene column contained characters, the entire matrix was set to character class(htseqCounts) <- "integer" # view the first few lines of the gene count matrix head(htseqCounts) -# it can also be usefull to view interactively (if in Rstudio) +# it can also be useful to view interactively (if in Rstudio) view(htseqCounts) ``` ### Filter raw counts -Before running DESeq2 or any differential expression analysis it is usefull to pre-filter data. There are computational benefits to doing this as the memory size of the objects within R will decrease and DESeq2 will have less data to work through and will be faster. By removing "low quality" data it is also avoids running multiple test correction on genes which are not relevant. The amount of pre-filtering is up to the analyst however it is not desireable to do too much, DESeq2 recomments removing any gene with less than 10 reads across samples. Below we filter a gene if at least 1 sample does not have at least 10 reads. +Before running DESeq2 or any differential expression analysis it is useful to pre-filter data. There are computational benefits to doing this as the memory size of the objects within R will decrease and DESeq2 will have less data to work through and will be faster. By removing "low quality" data, it is also reduces the number of statistical tests performed, which is turn reduces the impact of multiple test correction and can lead to more significant genes. + +The amount of pre-filtering is up to the analyst however, it should be done in an unbiased way. DESeq2 recommends removing any gene with less than 10 reads across all samples. Below, we filter a gene if at least 1 sample does not have at least 10 reads. Either way, mostly what is being removed here are genes with very little evidence for expression in any sample (in may cases 0 counts in all samples). ```R # run a filtering step # i.e. require that for every gene: at least 1 of 6 samples must have counts greater than 10 # get index of rows that meet this criterion and use that to subset the matrix # note the dimensions of the matrix before and after filtering with dim + dim(htseqCounts) htseqCounts <- htseqCounts[which(rowSums(htseqCounts >= 10) >=1),] dim(htseqCounts) -# Hint! if you find the above command confusing piece it apart +# Hint! if you find the above command confusing, break it into pieces and observe the result # -# what does rowSums(htseqCounts >= 10) do? +# what does "rowSums(htseqCounts >= 10)" do? # -# what does rowSums(htseqCounts >= 10) >=1 do? +# what does "rowSums(htseqCounts >= 10) >=1" do? ``` ### Specifying the experimental design -As mentioned above DESeq2 also needs to know the experimental design, that is which samples belong to which condition to test. The experimental design for the example dataset herein is quite simple as there are 6 samples with one condition to test, as such we can just create the experimental design right within R. There is one important thing to note, DESeq2 does not check sample names, it expects that the column names in the matrix we created correspond to the row names in the experimental design. +As mentioned above DESeq2 also needs to know the experimental design, that is which samples belong to which condition to test. The experimental design for the example dataset herein is quite simple as there are 6 samples with one condition to test, as such we can just create the experimental design right within R. There is one important thing to note, DESeq2 does not check sample names, it expects that the column names in the counts matrix we created correspond exactly to the row names we specifiy in the experimental design. ```R # construct a mapping of the meta data for our experiment (comparing UHR cell lines to HBR brain tissues) @@ -119,13 +124,13 @@ metaData <- data.frame('Condition'=c('UHR', 'UHR', 'UHR', 'HBR', 'HBR', 'HBR')) # convert the "Condition" column to a factor data type, this will determine the direction of log2 fold-changes for the genes (i.e. up or down regulated) metaData$Condition <- factor(metaData$Condition, levels=c('HBR', 'UHR')) -# set the row names of the dataframe to be the names of our sample replicates +# set the row names of the metaData dataframe to be the names of our sample replicates from the read counts matrix rownames(metaData) <- colnames(htseqCounts) # view the metadata dataframe head(metaData) -# check that htseq count cols match meta data rows +# check that names of htseq count columns match the names of the meta data rows # use the "all" function which tests whether an entire logical vector is TRUE all(rownames(metaData) == colnames(htseqCounts)) ``` @@ -137,8 +142,9 @@ With all the data properly formatted it is now possible to combine all the infor # make deseq2 data sets # here we are setting up our experiment by supplying: (1) the gene counts matrix, (2) the sample/replicate for each column, and (3) the biological conditions we wish to compare. # this is a simple example that works for many experiments but these can also get more complex -# for example, including designs with multiple variables, e.g., ~ group + condition, -# and designs with interactions, e.g., ~ genotype + treatment + genotype:treatment. +# for example, including designs with multiple variables such as "~ group + condition", +# and designs with interactions such as "~ genotype + treatment + genotype:treatment". + dds <- DESeqDataSetFromMatrix(countData = htseqCounts, colData = metaData, design = ~Condition) ``` @@ -146,7 +152,9 @@ dds <- DESeqDataSetFromMatrix(countData = htseqCounts, colData = metaData, desig With all the data now in place DESeq2 can be run. Calling DESeq2 will perform the following actions: - Estimation of size factors - Estimation of dispersion -- Negative Binomial GLM fitting and Wald statistic +- Perform "independent filtering" to reduce the number of statistical test performed (see ?results and https://doi.org/10.1073/pnas.0914005107 for details) +- Negative Binomial GLM fitting and performing the Wald statistical test +- Correct p values for multiple testing using the Benjamini and Hochberg method ```R # run the DESeq2 analysis on the "dds" object @@ -166,7 +174,7 @@ It is good practice to shrink the log-fold change values, this does exactly what # In simplistic terms, the goal of calculating "dispersion estimates" and "shrinkage" is also to account for the problem that # genes with low mean counts across replicates tend of have higher variability than those with higher mean counts. -# Shrinkage attempts to correct for this. For a detailed discussion of shrinkage refer to the DESeq2 vignette +# Shrinkage attempts to correct for this. For a more detailed discussion of shrinkage refer to the DESeq2 vignette # first get the name of the coefficient (log fold change) to shrink resultsNames(dds) @@ -183,7 +191,7 @@ head(deGeneResult) ``` ### Annotate gene symbols onto the DE results -DESeq2 was run with ensembl gene id's as identifiers, this is not the most human friendly way to interpret results. Here gene symbols are merged onto the differential expressed gene list to make results a bit more interpretable. +DESeq2 was run with ensembl gene IDs as identifiers, this is not the most human friendly way to interpret results. Here gene symbols are merged onto the differential expressed gene list to make the results a bit more interpretable. ```R # read in gene ID to name mappings (using "fread" an alternative to "read.table") @@ -211,7 +219,7 @@ head(deGeneResult) ``` ### Data manipulation -With the DE analysis complete it is usefull to view and filter the data frames to only the relevant genes, here some basic data manipulation is performed filtering to significant genes at specific thresholds. +With the DE analysis complete it is useful to view and filter the data frames to only the relevant genes, here some basic data manipulation is performed filtering to significant genes at specific thresholds. ```R # view the top genes according to adjusted p-value @@ -220,19 +228,19 @@ deGeneResult[order(deGeneResult$padj),] # view the top genes according to fold change deGeneResult[order(deGeneResult$log2FoldChange),] -# determine the number of up/down significant genes at FDR = 0.05 significance level +# determine the number of up/down significant genes at FDR < 0.05 significance level dim(deGeneResult) # number of genes tested dim(deGeneResult[deGeneResult$padj < 0.05]) #number of significant genes # order the DE results by adjusted p-value deGeneResultSorted = deGeneResult[order(deGeneResult$padj),] -# create a filtered data frame the limits to only significantly DE genes +# create a filtered data frame that limits to only the significant DE genes (adjusted p.value < 0.05) deGeneResultSignificant = deGeneResultSorted[deGeneResultSorted$padj < 0.05] ``` -### write out results -The data generated is now written out as tab separated files. Some of the DESeq2 objects are also saved as serialized R objects which can be read back into R later for visualization. +### Save results to files +The data generated is now written out as tab separated files. Some of the DESeq2 objects are also saved as serialized R (RDS) objects which can be read back into R later for visualization. ```R # set the working directory to the output dir where we will store any results files @@ -249,7 +257,7 @@ saveRDS(dds, 'dds.rds') saveRDS(res, 'res.rds') saveRDS(resLFC, 'resLFC.rds') -#To exit R type the following +# to exit R type the following #quit(save="no") ```