Parameters documentation is available in the pipeline schema. You can use the command
nf-core schema docs
to output parameters documentation. To avoid duplication of information, we minimize parameters details in markdown files. Currently, we only add context for the reference data parameters and provide parameter summaries for convenience.
The Ferlab-Ste-Justine/Post-processing-Pipeline is a bioinformatics pipeline designed for family-based analysis of GVCFs from multiple samples. It performs joint genotyping, tags low-quality variants, and optionally annotates the final VCF using VEP and/or Exomiser. This document provides instructions on how to prepare input files, run the pipeline, and understand the output.
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use the --input
parameter to specify its location. The samplesheet has to be a comma separated file (.csv).
The samplesheet must contains the following columns at the minimum:
- familyId: The identifier used for the sample family
- sample: The identifier used for the sample
- sequencingType: Must be either WES (Whole Exome Sequencing) or WGS (Whole Genome Sequencing)
- gvcf: Path to the sample .gvcf.gz file
Additionnally, there is an optional phenoFamily column that can contain a .yml/.json file providing phenotype information on the family in phenopacket format. This column is only necessary if using the exomiser tool.
sample.csv
**familyId**,**sample**,**sequencingType**,**gvcf**,**phenoFamily**
CONGE-XXX,01,WES,CONGE-XXX-01.hard-filtered.gvcf.gz,CONGE-XXX.pheno.yml
CONGE-XXX,02,WES,CONGE-XXX-02.hard-filtered.gvcf.gz,CONGE-XXX.pheno.yml
CONGE-XXX,03,WES,CONGE-XXX-03.hard-filtered.gvcf.gz,CONGE-XXX.pheno.yml
CONGE-YYY,01,WGS,CONGE-YYY-01.hard-filtered.gvcf.gz,CONGE-YYY.pheno.yml
CONGE-YYY,02,WGS,CONGE-YYY-02.hard-filtered.gvcf.gz,CONGE-YYY.pheno.yml
CONGE-YYY,03,WGS,CONGE-YYY-03.hard-filtered.gvcf.gz,CONGE-YYY.pheno.yml
Note
The sequencing type (WES or WGS) will determine the variant filtering approach used by the pipeline. In the case of Whole Genome Sequencing, VQSR (Variant Quality Score Recalibration) is used. In the case of Whole Exome Sequencing, VQSR is replaced by a hard filtering approach as VQSR cannot be applied in this case. Additionally, a different analysis file will be used when running the exomiser tool based on the sequencing type.
Reference files are essential at various stages of the workflow, including joint-genotyping, VQSR, the Variant Effect Predictor (VEP), and exomiser.
These files must be correctly downloaded and specified through pipeline parameters. For more details about how to this, see reference_data.md.
The typical command for running the pipeline is as follows:
nextflow run -c cluster.config Ferlab-Ste-Justine/Post-processing-Pipeline -r "v2.4.1" \
-params-file params.json \
--input samplesheet.csv \
--outdir results/dir \
--tools vep,exomiser
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
<OUTDIR> # Finished results in specified location (defined with --outdir)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file (json or yaml).
:::warning
Do not use -c <file>
to specify parameters as this will result in errors. Custom config files specified with -c
must only be used for tuning process resource specifications, other infrastructural tweaks (such as output directories), or module arguments (args).
:::
At the beginning of our workflow, we separate MNPs into individual SNPs.
You can optionally skip this step by setting the exclude_mnps
parameter to false
(default is true
).
Note that MNPs are not supported by the VQSR procedure, so you cannot skip this step if you have whole genome data.
You can include additional analysis in your pipeline via the tools
parameter. Currently, the pipeline supports
two tools: vep
(Variant Effect Predictor) and exomizer
.
VEP is a widely used tool for annotating genetic variants with information such as gene names, variant consequences, and population frequencies. It provides valuable insights into the functional impact of genetic variants.
Exomiser, on the other hand, is a tool specifically designed for the analysis of rare genetic diseases. It integrates phenotype data with variant information to prioritize variants that are likely to be disease-causing. This can greatly assist in the identification of potential disease-causing variants in exome sequencing data.
By default, both vep and exomiser steps, if applicable, run in parallel and consume the output of the normalization step.
To have the Exomiser step start from the VEP output instead, set the parameter exomiser_start_from_vep
to true
. In this case, the vep and exomiser steps will run sequentially.
Note that the parameter exomiser_start_from_vep
will be ignored if vep is not specified via the tools
parameter.
We typically allow passing extra arguments in our process scripts via the process task.ext
directive (task.ext.args
key).
When using the exomiser process, it's important to distinguish between regular CLI options and options that correspond to properties normally specified in the application.properties file.
Regular CLI options should be added to task.ext.args
.
Options that correspond to application properties (e.g., typically --exomiser.some-property=value
) must be added to task.ext.application_properties_args
. These options need to be grouped at the end of the exomiser command to ensure that regular exomiser cli options are parsed correctly.
If needed, it is possible to customize the options passed to the vep command by overriding the ext.args directive for the ENSEMBLVEP_VEP process. See conf/modules.config.
The -stub
(or -stub-run
) option can be added to run the "stub" block of processes instead of the "script" block. This can be helpful for testing.
To test your setup in stub mode, simply run nextflow run Ferlab-Ste-Justine/Post-processing-Pipeline -profile test,docker -stub
.
For tests with real data, see documentation in the test configuration profile
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
nextflow pull ferlab/postprocessing
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the ferlab/postprocessing releases page and find the latest pipeline version - numeric only (eg. 1.3.1
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 1.3.1
. Of course, you can switch to another version by changing the number after the -r
flag.
This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. For example, at the bottom of the MultiQC reports.
To further assist in reproducbility, you can use share and re-use parameter files to repeat pipeline runs with the same settings without having to write out a command with every single parameter.
:::tip If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles. :::
- Use the
-profile
parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments (e.g., docker, singularity, conda). Multiple profiles can be loaded in sequence, e.g.,-profile test,docker
. - Use the
-resume
parameter to restart a pipeline from where it left off. This can save time by using cached results from previous runs. - You can specify a custom configuration file using the
-c
parameter. This is useful to set configuration specific to your execution environment and change requested resources for a process.
For more detailed information, please refer to the official Nextflow documentation.
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
To limit this, you can use the NXF_OPTS
environment variable:
NXF_OPTS='-Xms1g -Xmx4g'
Parameter name | Required? | Description |
---|---|---|
input |
Required | Path to the input file |
outdir |
Required | Path to the output directoy |
referenceGenome |
Required | Path to the directory containing the reference genome data |
referenceGenomeFasta |
Required | Filename of the reference genome .fasta file, within the specified referenceGenome directory |
dbsnpFile |
Optional | Path to dbsnp file. If specified, will be used to add ids in the ID column of output vcf files. |
dbsnpFileIndex |
Optional | Path to dbsnp file index. Must be specified if the dbsnpFile parameter is specified. |
broad |
Optional | Path to the directory containing Broad reference data (for VQSR) |
intervalsFile |
Optional | Path to the file containg the genome intervals list on which to operate |
tools |
Optional | Additional tools to run separated by commas. Supported tools are vep and exomiser |
vepCache |
Optional | Path to the vep cache data directory |
exclude_mnps |
Optional | Replace MNPs by individual SNPs (default: true). Must be true on whole genome data. |
exomiser_data_dir |
Optional | Path to the exomiser reference data directory |
exomiser_genome |
Optional | Genome assembly version to be used by exomiser(hg19 or hg38 ) |
exomiser_data_version |
Optional | Exomiser data version (e.g., 2402 ) |
exomiser_cadd_version |
Optional | Version of the CADD data to be used by exomiser (e.g., 1.7 ) |
exomiser_cadd_indel_filename |
Optional | Filename of the exomiser CADD indel data file (e.g., gnomad.genomes.r4.0.indel.tsv.gz ) |
exomiser_cadd_snv_filename |
Optional | Filename of the exomiser CADD snv data file (e.g., whole_genome_SNVs.tsv.gz ) |
exomiser_remm_version |
Optional | Version of the REMM data to be used by exomiser (e.g., 0.3.1.post1 ) |
exomiser_remm_filename |
Optional | Filename of the exomiser REMM data file (e.g., ReMM.v0.3.1.post1.hg38.tsv.gz ) |
exomiser_analysis_wes |
Optional | Path to the exomiser analysis file for WES data, if different from the default |
exomiser_analysis_wgs |
Optional | Path to the exomiser analysis file for WGS data, if different from the default |
exomiser_start_from_vep |
Optional | If true (default false ), run the exomiser analysis on the VEP annotated VCF file. Ignored if vep is not activated via tools parameter. |