document new features; enhance report; add versions to all environments

epigen · Apr 6, 2023 · b7de53d · b7de53d
1 parent edade34
commit b7de53d
Show file tree

Hide file tree

Showing 9 changed files with 64 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # Differential Analysis & Visualization Snakemake Workflow Using LIMMA
 A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for performing and visualizing differential expression (or accessibility) analyses (DEA) of NGS data (eg RNA-seq, ATAC-seq, scRNA-seq,...) powered by the R package [limma](https://www.bioconductor.org/packages/release/bioc/html/limma.html).
 
+This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.
+
 **If you use this workflow in a publication, don't forget to give credits to the authors by citing the URL of this (original) repository (and its DOI, see Zenodo badge above -> coming soon).**
 
 ![Workflow Rulegraph](./workflow/dags/rulegraph.svg)
@@ -16,6 +18,7 @@ Table of contents
   * [Examples](#examples)
   * [Links](#links)
   * [Resources](#resources)
+  * [Publications](#publications)
 
 # Authors
 - [Stephan Reichl](https://github.com/sreichl)
@@ -38,14 +41,13 @@ This project wouldn't be possible without the following software and their depen
 # Methods
 This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].
 
-__Differential Expression Analysis (DEA).__ DEA was performed on the quality-controlled filtered [raw/normalized] counts using the LIMMA (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we used topTable to extract feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Furthermore, we calculated feature scores, for each feature in all comparisons, using the formula [score_formula] for downstream ranked enrichment analyses. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), absolute log2 fold change (> [X]), and average gene expression (> [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.
+__Differential Expression Analysis (DEA).__ DEA was performed on the quality-controlled filtered [raw/normalized] counts using the LIMMA (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we used topTable to extract feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Furthermore, we calculated feature scores, for each feature in all comparisons, using the formula [score_formula] for downstream ranked enrichment analyses. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), effect size (absolute log2 fold change > [X]), and expression (average expression > [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.
 
 __Visualization.__ The filtered result statistics, i.e., number of relevant features split by positive (up) and negative (down) effect sizes, were visualized with stacked bar plots using ggplot (ver) [ref].
-To visually summarize results of all performed comparisons, the filtered effect size (log2 fold change) values of all features that were found to be relevant in at least one comparison were plotted in a hierarchically clustered heatmap using pheatmap (ver) [ref]. 
+To visually summarize results of all performed comparisons, the effect size (log2 fold change) values of all relevant features in at least one comparison were plotted in a hierarchically clustered heatmap using pheatmap (ver) [ref]. 
 Volcano plots were generated for each comparison using EnhancedVolcano (ver) [ref] with adjusted p-value threshold of [pCutoff] and log2 fold change threshold of [FCcutoff] as visual cut-offs for the y- and x-axis, respectively.
 Finally, quality control plots of the fitted mean-variance relationship and raw p-values of the features were generated.
 
-
 **The analysis and visualizations described here were performed using a publicly available Snakemake (ver) [ref] workflow (ver) [ref - cite this workflow here].**
 
 # Features
@@ -69,14 +71,20 @@ The workflow performs the following steps that produce the outlined results:
 - DEA result filtering of features (eg genes) by 
   - statistical significance (<= adjusted p-value: adj_pval)
   - effect size (>= absolute log 2 fold change: lfc)
-  - average expression (>= ave_expr) in the data
+  - average expression (>= ave_expr) in the data (to skip this filter use `-Inf`)
 - Log Fold Change (LFC) matrix of filtered features by comparison groups (CSV).
   - (optional) annotated LFC matrix with suffix "_annot" (CSV)
 - Visualizations
   - filtered DEA result statistics ie number of features and direction (stacked bar plots)
-  - volanco plot per comparison with configured cut-offs for statistical significance (pCutoff) and effect size (FCcutoff)
-  - clustered heatmap of the LFC matrix
-  - quality control plots
+  - volanco plots per comparison with effect size on the x-axis and raw p-value(rawp)/adjusted p-value (adjp) on the y-axis
+      - highlighting features according to configured cut-offs for statistical significance (pCutoff) and effect size (FCcutoff)
+      - (optional) highlighting features according to configured feature lists
+  - hierarchically clustered heatmap of effect sizes (LFC) per comparison (features x comparisons) indicating statistical significance with a star '\*'
+      - using all relevant features (FILTERED)
+      - (optional) using configured feature lists
+      - in case of more than 100 features the row labels and significance indicators (\*) are removed
+      - in case of more than 50000 features no heatmap is generated
+  - diagnostic quality control plots
       - (optional) voom mean-variance trend
       - (optional) intermediate mean-variance trend, in case of blocking and vooming
       - post-fitting mean-variance trend
@@ -107,6 +115,9 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
 - [Snakemake Workflow Catalog Entry](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/dea_limma)
 
 # Resources
+- Recommended [MR.PARETO](https://github.com/epigen/mr.pareto) modules for downstream analyses:
+    - [Enrichment Analysis](https://github.com/epigen/enrichment_analysis)  for biodecial interpretation of results.
+    - [Genome Tracks](https://github.com/epigen/genome_tracks) for visualization of top hits.
 - [Bioconductor - limma](http://bioconductor.org/packages/release/bioc/html/limma.html) includes a 150 page userguides
 - [R Manual on Model Formulae](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html)
 - [Bioconductor - RNAseq123 - Workflow](https://bioconductor.org/packages/release/workflows/html/RNAseq123.html)
@@ -126,3 +137,7 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
 - alternative/complementary DEA method: Linear Mixed Models (LMM)
     - [variancePartition](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1323-z)
     - [dream](https://academic.oup.com/bioinformatics/article/37/2/192/5878955)
+
+# Publications
+The following publications successfully used this module for their analyses.
+- ...
diff --git a/config/config.yaml b/config/config.yaml
@@ -56,8 +56,8 @@ volcano:
     FCcutoff: 2
 
 # path(s) to feature lists as plain text files (.txt) with one gene per line.
-# if feature_annotation is provided then the provided feature names are expected, otherwise the features from the input data frame are used
-# used to plot highlight in volcano plots and generate LFC clustered heatmaps
+# if feature_annotation is provided then the provided feature names are expected, otherwise the features from the input data frame are used.
+# used to highlight features in volcano plots and generate LFC clustered heatmaps
 # only use camelCase for the feature_list names like in the examples below.
 # if not used leave empty.
 feature_lists:

diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -11,6 +11,8 @@ min_version("6.0.3")
 SDIR = os.path.realpath(os.path.dirname(srcdir("Snakefile")))
 shell.prefix(f"set -eo pipefail;")
 
+module_name = "dea_limma"
+
 ##### container image #####
 # containerized: "docker://sreichl/..."
 
@@ -34,7 +36,7 @@ feature_lists_dict = {k: v for k, v in feature_lists_dict.items() if v!=""}
 if feature_lists_dict is not None:
     feature_lists = feature_lists + list(feature_lists_dict.keys())
 
-result_path = os.path.join(config["result_path"],'dea_limma')
+result_path = os.path.join(config["result_path"], module_name)
 
 rule all:
     input:

diff --git a/workflow/envs/ggplot.yaml b/workflow/envs/ggplot.yaml
@@ -2,5 +2,5 @@ channels:
   - conda-forge
   - defaults
 dependencies:
-  - r-ggplot2
-  - r-patchwork
+  - r-ggplot2=3.3.6
+  - r-patchwork=1.1.2
diff --git a/workflow/envs/limma.yaml b/workflow/envs/limma.yaml
@@ -4,5 +4,5 @@ channels:
   - defaults
 dependencies:
   - bioconductor-limma=3.46.0
-  - bioconductor-edger
-  - r-statmod
+  - bioconductor-edger=3.32.1
+  - r-statmod=1.4.37
diff --git a/workflow/envs/volcanos.yaml b/workflow/envs/volcanos.yaml
@@ -4,4 +4,4 @@ channels:
   - defaults
 dependencies:
   - bioconductor-enhancedvolcano=1.12.0
-  - r-patchwork
+  - r-patchwork=1.1.2
diff --git a/workflow/rules/dea.smk b/workflow/rules/dea.smk
@@ -36,13 +36,23 @@ rule aggregate:
         # filtered_features = os.path.join(result_path,'{analysis}','feature_lists','FILTERED_features.txt'),
         dea_stats = report(os.path.join(result_path,'{analysis}','DEA_stats.csv'), 
                                   caption="../report/dea_stats.rst", 
-                                  category="{}_dea_limma".format(config["project_name"]), 
-                                  subcategory="{analysis}"),
+                                  category="{}_{}".format(config["project_name"], module_name),
+                                  subcategory="{analysis}",
+                                  labels={
+                                      "name": "DEA statistics",
+                                      "type": "table",
+                                      "misc": "CSV",
+                                  }),
         # dea_lfc = os.path.join(result_path,'{analysis}','DEA_LFC.csv'),
         dea_stats_plot = report(os.path.join(result_path,'{analysis}','plots','DEA_stats.png'), 
                                   caption="../report/dea_stats.rst", 
-                                  category="{}_dea_limma".format(config["project_name"]), 
-                                  subcategory="{analysis}"),
+                                  category="{}_{}".format(config["project_name"], module_name),
+                                  subcategory="{analysis}",
+                                  labels={
+                                      "name": "DEA statistics",
+                                      "type": "stacked bar plot",
+                                      "misc": "PNG",
+                                  }),
     resources:
         mem_mb=config.get("mem", "16000"),
     threads: config.get("threads", 1)

diff --git a/workflow/rules/envs_export.smk b/workflow/rules/envs_export.smk
@@ -4,7 +4,7 @@ rule env_export:
         report(os.path.join(config["result_path"],'envs','dea_limma','{env}.yaml'),
                       caption="../report/software.rst", 
                       category="Software", 
-                      subcategory="{}_dea_limma".format(config["project_name"])
+                      subcategory="{}_{}".format(config["project_name"], module_name)
                      ),
     conda:
         "../envs/{env}.yaml"
@@ -26,7 +26,7 @@ rule config_export:
         configs = report(os.path.join(config["result_path"],'configs','dea_limma','{}_config.yaml'.format(config["project_name"])), 
                          caption="../report/configs.rst", 
                          category="Configuration", 
-                         subcategory="{}_dea_limma".format(config["project_name"])
+                         subcategory="{}_{}".format(config["project_name"], module_name)
                         )
     resources:
         mem_mb=config.get("mem", "16000"),
@@ -47,7 +47,7 @@ rule annot_export:
         annot = report(os.path.join(config["result_path"],'configs','dea_limma','{}_annot.csv'.format(config["project_name"])), 
                          caption="../report/configs.rst", 
                          category="Configuration", 
-                         subcategory="{}_dea_limma".format(config["project_name"])
+                         subcategory="{}_{}".format(config["project_name"], module_name)
                         )
     resources:
         mem_mb=1000, #config.get("mem_small", "16000"),
@@ -69,7 +69,7 @@ rule feature_list_export:
         feature_lists = report(os.path.join(config["result_path"],'configs','dea_limma','{feature_list}.txt'), 
                             caption="../report/feature_lists.rst", 
                             category="Configuration", 
-                            subcategory="{}_dea_limma".format(config["project_name"])
+                            subcategory="{}_{}".format(config["project_name"], module_name)
                            ),
     resources:
         mem_mb=1000, #config.get("mem_small", "16000"),config.get("mem", "16000"),

diff --git a/workflow/rules/visualize.smk b/workflow/rules/visualize.smk
@@ -6,8 +6,13 @@ rule volcanos:
     output:
         dea_volcanos = report(os.path.join(result_path,'{analysis}','plots','DEA_volcanos_{feature_list}_{pval_type}.png'),
                               caption="../report/volcano.rst",
-                              category="{}_dea_limma".format(config["project_name"]),
-                              subcategory="{analysis}"),
+                              category="{}_{}".format(config["project_name"], module_name),
+                              subcategory="{analysis}",
+                              labels={
+                                  "name": "Volcano plot",
+                                  "type": "{pval_type}",
+                                  "misc": "{feature_list}",
+                              }),
     resources:
         mem_mb=config.get("mem", "16000"),
     threads: config.get("threads", 1)
@@ -30,8 +35,13 @@ rule lfc_heatmap:
     output:
         dea_lfc_heatmap = report(os.path.join(result_path,'{analysis}','plots','DEA_LFC_heatmap_{feature_list}.png'),
                               caption="../report/lfc_heatmap.rst",
-                              category="{}_dea_limma".format(config["project_name"]),
-                              subcategory="{analysis}"),
+                              category="{}_{}".format(config["project_name"], module_name),
+                              subcategory="{analysis}",
+                              labels={
+                                  "name": "Heatmap",
+                                  "type": "effect sizes",
+                                  "misc": "{feature_list}",
+                              }),
     resources:
         mem_mb=config.get("mem", "16000"),
     threads: config.get("threads", 1)