Skip to content

Commit

Permalink
document new features; enhance report; add versions to all environments
Browse files Browse the repository at this point in the history
  • Loading branch information
sreichl committed Apr 6, 2023
1 parent edade34 commit b7de53d
Show file tree
Hide file tree
Showing 9 changed files with 64 additions and 27 deletions.
29 changes: 22 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Differential Analysis & Visualization Snakemake Workflow Using LIMMA
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for performing and visualizing differential expression (or accessibility) analyses (DEA) of NGS data (eg RNA-seq, ATAC-seq, scRNA-seq,...) powered by the R package [limma](https://www.bioconductor.org/packages/release/bioc/html/limma.html).

This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.

**If you use this workflow in a publication, don't forget to give credits to the authors by citing the URL of this (original) repository (and its DOI, see Zenodo badge above -> coming soon).**

![Workflow Rulegraph](./workflow/dags/rulegraph.svg)
Expand All @@ -16,6 +18,7 @@ Table of contents
* [Examples](#examples)
* [Links](#links)
* [Resources](#resources)
* [Publications](#publications)

# Authors
- [Stephan Reichl](https://github.com/sreichl)
Expand All @@ -38,14 +41,13 @@ This project wouldn't be possible without the following software and their depen
# Methods
This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

__Differential Expression Analysis (DEA).__ DEA was performed on the quality-controlled filtered [raw/normalized] counts using the LIMMA (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we used topTable to extract feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Furthermore, we calculated feature scores, for each feature in all comparisons, using the formula [score_formula] for downstream ranked enrichment analyses. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), absolute log2 fold change (> [X]), and average gene expression (> [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.
__Differential Expression Analysis (DEA).__ DEA was performed on the quality-controlled filtered [raw/normalized] counts using the LIMMA (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we used topTable to extract feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Furthermore, we calculated feature scores, for each feature in all comparisons, using the formula [score_formula] for downstream ranked enrichment analyses. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), effect size (absolute log2 fold change > [X]), and expression (average expression > [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.

__Visualization.__ The filtered result statistics, i.e., number of relevant features split by positive (up) and negative (down) effect sizes, were visualized with stacked bar plots using ggplot (ver) [ref].
To visually summarize results of all performed comparisons, the filtered effect size (log2 fold change) values of all features that were found to be relevant in at least one comparison were plotted in a hierarchically clustered heatmap using pheatmap (ver) [ref].
To visually summarize results of all performed comparisons, the effect size (log2 fold change) values of all relevant features in at least one comparison were plotted in a hierarchically clustered heatmap using pheatmap (ver) [ref].
Volcano plots were generated for each comparison using EnhancedVolcano (ver) [ref] with adjusted p-value threshold of [pCutoff] and log2 fold change threshold of [FCcutoff] as visual cut-offs for the y- and x-axis, respectively.
Finally, quality control plots of the fitted mean-variance relationship and raw p-values of the features were generated.


**The analysis and visualizations described here were performed using a publicly available Snakemake (ver) [ref] workflow (ver) [ref - cite this workflow here].**

# Features
Expand All @@ -69,14 +71,20 @@ The workflow performs the following steps that produce the outlined results:
- DEA result filtering of features (eg genes) by
- statistical significance (<= adjusted p-value: adj_pval)
- effect size (>= absolute log 2 fold change: lfc)
- average expression (>= ave_expr) in the data
- average expression (>= ave_expr) in the data (to skip this filter use `-Inf`)
- Log Fold Change (LFC) matrix of filtered features by comparison groups (CSV).
- (optional) annotated LFC matrix with suffix "_annot" (CSV)
- Visualizations
- filtered DEA result statistics ie number of features and direction (stacked bar plots)
- volanco plot per comparison with configured cut-offs for statistical significance (pCutoff) and effect size (FCcutoff)
- clustered heatmap of the LFC matrix
- quality control plots
- volanco plots per comparison with effect size on the x-axis and raw p-value(rawp)/adjusted p-value (adjp) on the y-axis
- highlighting features according to configured cut-offs for statistical significance (pCutoff) and effect size (FCcutoff)
- (optional) highlighting features according to configured feature lists
- hierarchically clustered heatmap of effect sizes (LFC) per comparison (features x comparisons) indicating statistical significance with a star '\*'
- using all relevant features (FILTERED)
- (optional) using configured feature lists
- in case of more than 100 features the row labels and significance indicators (\*) are removed
- in case of more than 50000 features no heatmap is generated
- diagnostic quality control plots
- (optional) voom mean-variance trend
- (optional) intermediate mean-variance trend, in case of blocking and vooming
- post-fitting mean-variance trend
Expand Down Expand Up @@ -107,6 +115,9 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
- [Snakemake Workflow Catalog Entry](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/dea_limma)

# Resources
- Recommended [MR.PARETO](https://github.com/epigen/mr.pareto) modules for downstream analyses:
- [Enrichment Analysis](https://github.com/epigen/enrichment_analysis) for biodecial interpretation of results.
- [Genome Tracks](https://github.com/epigen/genome_tracks) for visualization of top hits.
- [Bioconductor - limma](http://bioconductor.org/packages/release/bioc/html/limma.html) includes a 150 page userguides
- [R Manual on Model Formulae](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html)
- [Bioconductor - RNAseq123 - Workflow](https://bioconductor.org/packages/release/workflows/html/RNAseq123.html)
Expand All @@ -126,3 +137,7 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
- alternative/complementary DEA method: Linear Mixed Models (LMM)
- [variancePartition](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1323-z)
- [dream](https://academic.oup.com/bioinformatics/article/37/2/192/5878955)

# Publications
The following publications successfully used this module for their analyses.
- ...
4 changes: 2 additions & 2 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ volcano:
FCcutoff: 2

# path(s) to feature lists as plain text files (.txt) with one gene per line.
# if feature_annotation is provided then the provided feature names are expected, otherwise the features from the input data frame are used
# used to plot highlight in volcano plots and generate LFC clustered heatmaps
# if feature_annotation is provided then the provided feature names are expected, otherwise the features from the input data frame are used.
# used to highlight features in volcano plots and generate LFC clustered heatmaps
# only use camelCase for the feature_list names like in the examples below.
# if not used leave empty.
feature_lists:
Expand Down
4 changes: 3 additions & 1 deletion workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ min_version("6.0.3")
SDIR = os.path.realpath(os.path.dirname(srcdir("Snakefile")))
shell.prefix(f"set -eo pipefail;")

module_name = "dea_limma"

##### container image #####
# containerized: "docker://sreichl/..."

Expand All @@ -34,7 +36,7 @@ feature_lists_dict = {k: v for k, v in feature_lists_dict.items() if v!=""}
if feature_lists_dict is not None:
feature_lists = feature_lists + list(feature_lists_dict.keys())

result_path = os.path.join(config["result_path"],'dea_limma')
result_path = os.path.join(config["result_path"], module_name)

rule all:
input:
Expand Down
4 changes: 2 additions & 2 deletions workflow/envs/ggplot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ channels:
- conda-forge
- defaults
dependencies:
- r-ggplot2
- r-patchwork
- r-ggplot2=3.3.6
- r-patchwork=1.1.2
4 changes: 2 additions & 2 deletions workflow/envs/limma.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ channels:
- defaults
dependencies:
- bioconductor-limma=3.46.0
- bioconductor-edger
- r-statmod
- bioconductor-edger=3.32.1
- r-statmod=1.4.37
2 changes: 1 addition & 1 deletion workflow/envs/volcanos.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ channels:
- defaults
dependencies:
- bioconductor-enhancedvolcano=1.12.0
- r-patchwork
- r-patchwork=1.1.2
18 changes: 14 additions & 4 deletions workflow/rules/dea.smk
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,23 @@ rule aggregate:
# filtered_features = os.path.join(result_path,'{analysis}','feature_lists','FILTERED_features.txt'),
dea_stats = report(os.path.join(result_path,'{analysis}','DEA_stats.csv'),
caption="../report/dea_stats.rst",
category="{}_dea_limma".format(config["project_name"]),
subcategory="{analysis}"),
category="{}_{}".format(config["project_name"], module_name),
subcategory="{analysis}",
labels={
"name": "DEA statistics",
"type": "table",
"misc": "CSV",
}),
# dea_lfc = os.path.join(result_path,'{analysis}','DEA_LFC.csv'),
dea_stats_plot = report(os.path.join(result_path,'{analysis}','plots','DEA_stats.png'),
caption="../report/dea_stats.rst",
category="{}_dea_limma".format(config["project_name"]),
subcategory="{analysis}"),
category="{}_{}".format(config["project_name"], module_name),
subcategory="{analysis}",
labels={
"name": "DEA statistics",
"type": "stacked bar plot",
"misc": "PNG",
}),
resources:
mem_mb=config.get("mem", "16000"),
threads: config.get("threads", 1)
Expand Down
8 changes: 4 additions & 4 deletions workflow/rules/envs_export.smk
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ rule env_export:
report(os.path.join(config["result_path"],'envs','dea_limma','{env}.yaml'),
caption="../report/software.rst",
category="Software",
subcategory="{}_dea_limma".format(config["project_name"])
subcategory="{}_{}".format(config["project_name"], module_name)
),
conda:
"../envs/{env}.yaml"
Expand All @@ -26,7 +26,7 @@ rule config_export:
configs = report(os.path.join(config["result_path"],'configs','dea_limma','{}_config.yaml'.format(config["project_name"])),
caption="../report/configs.rst",
category="Configuration",
subcategory="{}_dea_limma".format(config["project_name"])
subcategory="{}_{}".format(config["project_name"], module_name)
)
resources:
mem_mb=config.get("mem", "16000"),
Expand All @@ -47,7 +47,7 @@ rule annot_export:
annot = report(os.path.join(config["result_path"],'configs','dea_limma','{}_annot.csv'.format(config["project_name"])),
caption="../report/configs.rst",
category="Configuration",
subcategory="{}_dea_limma".format(config["project_name"])
subcategory="{}_{}".format(config["project_name"], module_name)
)
resources:
mem_mb=1000, #config.get("mem_small", "16000"),
Expand All @@ -69,7 +69,7 @@ rule feature_list_export:
feature_lists = report(os.path.join(config["result_path"],'configs','dea_limma','{feature_list}.txt'),
caption="../report/feature_lists.rst",
category="Configuration",
subcategory="{}_dea_limma".format(config["project_name"])
subcategory="{}_{}".format(config["project_name"], module_name)
),
resources:
mem_mb=1000, #config.get("mem_small", "16000"),config.get("mem", "16000"),
Expand Down
18 changes: 14 additions & 4 deletions workflow/rules/visualize.smk
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,13 @@ rule volcanos:
output:
dea_volcanos = report(os.path.join(result_path,'{analysis}','plots','DEA_volcanos_{feature_list}_{pval_type}.png'),
caption="../report/volcano.rst",
category="{}_dea_limma".format(config["project_name"]),
subcategory="{analysis}"),
category="{}_{}".format(config["project_name"], module_name),
subcategory="{analysis}",
labels={
"name": "Volcano plot",
"type": "{pval_type}",
"misc": "{feature_list}",
}),
resources:
mem_mb=config.get("mem", "16000"),
threads: config.get("threads", 1)
Expand All @@ -30,8 +35,13 @@ rule lfc_heatmap:
output:
dea_lfc_heatmap = report(os.path.join(result_path,'{analysis}','plots','DEA_LFC_heatmap_{feature_list}.png'),
caption="../report/lfc_heatmap.rst",
category="{}_dea_limma".format(config["project_name"]),
subcategory="{analysis}"),
category="{}_{}".format(config["project_name"], module_name),
subcategory="{analysis}",
labels={
"name": "Heatmap",
"type": "effect sizes",
"misc": "{feature_list}",
}),
resources:
mem_mb=config.get("mem", "16000"),
threads: config.get("threads", 1)
Expand Down

0 comments on commit b7de53d

Please sign in to comment.