Merge pull request #47 from Ferlab-Ste-Justine/feat/CLIN-3508-improve…

…-output-file-documentation feat: CLIN-3508 improve output files documentation
Ferlab-Ste-Justine · Dec 9, 2024 · 8fea002 · 8fea002
2 parents e0667e4 + b278620
commit 8fea002
Show file tree

Hide file tree

Showing 2 changed files with 123 additions and 17 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [#44](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/44) Decouple the interval file parameter from the broad
 - [#45](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/45) Allow to add dbsnp ids to output vcf files
 - [#46](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/46) Allow to skip the exclude mnp step
+- [#47](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/47) Improve pipeline output documentation
 
 ### `Known issues`
 - The nf-core modules that we are using have a potential performance flaw. Typically, the regex used to describe the output files also match the input files (ex: "*.vcf"), which can cause unnecessary file transfers.  This has already proven to cause issues on fusion. One fix could be to transfer the whole modules to local to perform the small change necessary to fix this.

diff --git a/docs/output.md b/docs/output.md
@@ -3,28 +3,133 @@
 ## Introduction
 
 This document describes the output produced by the pipeline.
-The directories listed below will be created in the output directory after the pipeline has finished. All paths are relative to the top-level output directory.
+The directories described below will be created in the output directory after the pipeline has finished. All paths are relative to the top-level output directory.
 
+## Overview
 
-## Pipeline overview
+The pipeline output is saved step-by-step in the output directory as each step is completed. Below, we provide a description of the output folders corresponding to the main steps, as well as the `pipeline_info` folder, which contains details about the submitted job.
 
-### Pipeline information
+- [Directory Structure](#directory-structure)
+- [Pipeline Information: pipeline_info](#pipeline-information-pipeline_info)
+- [Normalization Step: splitmutiallelics](#normalization-step-splitmultiallelics)
+- [Vep Step: ensemblvep](#vep-step-ensemblvep)
+- [Exomiser Step: exomiser/results](#exomiser-step-exomiserresults)
+- [Other Steps](#others-steps)
 
-<details markdown="1">
-<summary>Output files</summary>
+## Directory Structure
 
-- `pipeline_info/`
-  - Reports generated by nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.html`. 
-  - Parameters used by the pipeline run: `params.json`.
-  - A copy of the nextflow log file: `nextflow.log`. Note that it will miss logs written after the workflow.onComplete handler is run.
-  - Copies of the configuration files used: `config/*.config`. This includes the default `nextflow.config` file as well as any additional configuration files passed as parameters.
-  - Other metadata relevant for reproducibility: `metadata.txt` . It contains information such as the original command line, the name of the branch and revision used, the username of the person who submitted the job, a list of configuration files passed, the nextflow work directory, etc.
-- `splitmultiallelics/`: pipeline output before running the tools specified via the `tools` parameter.
-- `ensemblvep/`: vep output
-- `exomiser/results`: exomiser output
+The output directory structure is as follow:
 
-You might see other folders named after different pipeline processes. These are considered intermediate pipeline outputs.
+```
+|_ pipeline_info/
+|_ splitmultiallelics/
+|_ ensemblvep/
+|_ exomiser/results/
+...
+```
+
+The `pipeline_info` subdirectory contains details about the pipeline execution and metadata relevant to reproducibility, performance optimization and troubleshooting.
+
+The `splitmultiallelics` subdirectory contains the output of the pipeline after completing the normalization step, just before running the vep or exomiser tools.
+
+The `ensemblvep` subdirectory contains the output after running vep and will appear only if vep is specified in the `tools` parameters.
+
+The `exomiser/results` subdirectory contains the output after running exomiser and will appear only if exomiser is specified in the `tools` parameters.
+
+## Pipeline Information: pipeline_info
+
+Here we describe in more details the content of the `pipeline_info `subdirectory. It should contain the following:
+
+```
+|_ pipeline_info
+   |_ configs
+      |_ nextflow.config
+          ... 
+   |_ execution_report_2024-12-09_12-03-20.html
+   |_ execution_timeline_2024-12-09_12-03-20.html
+   |_ execution_trace_2024-12-09_12-03-20.txt
+   |_ params_2024-12-09_12-03-23.json
+   |_ pipeline_dag_2024-12-09_12-03-20.html
+   |_ metadata.txt
+   |_ nextflow.log
+```
+
+  The timestamps that appear in some files are in the user's timezone.
+
+  The `configs` folder contains copies of configuration files used. This includes the default `nextflow.config` file as well as any additional configuration files passed as parameters.
+
+  The files prefixed by `execution_`are reports automatically generated by nextflow. These reports allow you to troubleshoot errors with the  pipeline execution and provide inofrmation such as launch commands, run times and resource usage. You can refer to the [nextflow documentation](https://www.nextflow.io/docs/latest/reports.html) for more details about these reports.
+
+  The file prefixed by `params` contains the parameters used by the pipeline.
+
+  The file prefixed by `pipeline_dag` contains a diagram of the pipeline steps.
+
+  The `metadata.txt` file contains various information relevant for reproducibility, such as the original command line, the name of the branch / revision used, the username associated to the command, a list of configuration files passed, the nextflow work directory, etc.
+
+  The `nextflow.log` file is a copy the nextflow log file.  Note that it will miss logs written after the `workflow.onComplete` handler is run.
+
+
+## Normalization Step: splitmultiallelics
 
-</details>
+The `splitmultiallelics` subdirectory contains the output of the pipeline after the normalization step, just before running vep and exomiser.
 
-[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
+```
+|_ splitmultiallelics/
+   |_ family1.splitted.vcf.gz
+   |_ family1.splitted.vcf.gz.tbi
+   ... 
+```
+
+It contains one pair of `vcf.gz`, `vcf.gz.tbi` files per family. Specifically, we use the following naming scheme:
+- `<FAMILY_ID>.splitted.vcf.gz`
+- `<FAMILY_ID>.splitted.vcf.gz.tbi`
+
+The family ID should match the family ID in the input sample sheet.
+
+## VEP Step: ensemblvep
+
+The `ensemblvep` subdirectory contains the output of the pipeline after the vep step, if vep was specified in the `tools` parameter.
+
+```
+|_ ensemblvep/
+  |_ variants.family1.vep.vcf.gz
+  |_ variants.family1.vep.vcf.gz.tbi
+  ...
+```
+
+It contains one pair of `vcf.gz`, `vcf.gz.tbi` files per family. Specifically, we use the following naming scheme:
+- `variants.<FAMILY_ID>.vep.vcf.gz`
+- `variants.<FAMILY_ID>.vep.vcf.gz.tbi`
+
+The family ID should match the family ID in the input sample sheet.
+
+## Exomiser Step: exomiser/results
+
+The `exomiser/results` subdirectory contains the output fo the pipeline after the exomiser step, if exomiser was specified in the `tools` parameter.
+
+```
+|_ exomiser/results
+   |_ family1.splitted-exomiser.genes.tsv
+   |_ family1.splitted-exomiser.html
+   |_ family1.splitted-exomiser.json
+   |_ family1.splitted-exomiser.variants.tsv
+   |_ family1.splitted-exomiser.vcf.gz
+   |_ family1.splitted-exomiser.vcf.gz.tbi
+  ...   
+```
+
+It should contains a set of 6 files per family.  Specifically, we use the following naming scheme:
+- `<FAMILY_ID>.splitted-exomiser.genes.tsv`
+- `<FAMILY_ID>.splitted-exomiser.html`
+- `<FAMILY_ID>.splitted-exomiser.json`
+- `<FAMILY_ID>.splitted-exomiser.variants.tsv`
+- `<FAMILY_ID>.splitted-exomiser.vcf.gz`
+- `<FAMILY_ID>.splitted-exomiser.vcf.gz.tbi`
+
+The family ID should match the family ID in the input sample sheet.
+
+For more details about the content of each of these files, you can have a look at the exomiser documentation [here](https://exomiser.readthedocs.io/en/latest/result_interpretation.html)
+
+## Others Steps
+
+You might see other folders named after different pipeline processes. These are considered intermediate pipeline outputs.