This document describes the parameter options used by the pipeline.
- Running the pipeline
- Inputs and outputs
- Reference genomes
- Adapter trimming
- Read alignment
- Additional parameters
- Software dependencies
- Other command line parameters
The main command for running the pipeline is as follows:
nextflow run epidiverse/wgbs [OPTIONS]
Note that the pipeline will create files in your working directory:
work/ # Directory containing the nextflow working files
wgbs/ # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
.nextflow/ # Nextflow cache and history information
Specify the path to the DIRECTORY containing the reads for analysis. All fastq files must be in compressed *{1,2}.fastq.gz
format for PE reads, with separate files for R1 and R2, or in *.fastq.gz
format for SE reads. PE and SE reads cannot be processed together, but any number of read sets of the same type may be included and they will be analysed in parallel. The input directory must contain at least one set of appropriate files for the pipeline to initiate.
Specify the path to a DIRECTORY containing a subset of input reads for higher precision analysis. The pipeline will analyse the --input
reads using erne-bs5 and the --merge
reads using segemehl. The file names in each directory must be identical, and all files must follow the same nomenclature as described for the --input
parameter. Any number of read sets may be included, and they will be analysed in parallel. The individual reads specified in the --merge
files must also be absent from the --input
files. [default: off]
Interpret input reads as SINGLE END instead of PAIRED END. Reads in this format must follow the *.fastq.gz
nomenclature. [default: off]
Specify the path to the output directory. [default: wgbs]
If you specify a custom reference genome for analysis, the pipeline expects a relative directory structure as pictured below. White boxes indicate sub-folders, and orange boxes indicate files. Blue boxes give a description of each file. If any files are missing, then associated processes may raise errors.
Path to the reference genome FILE should be in fasta format, with a corresponding fasta index file *.fai
in the same directory. The appropriate *.ebm
index for erne-bs5 or *.ctidx
and *.gaidx
index files for segemehl are also required. The lambda index is used for estimating bisulfite conversion rate with whichever spike-in is specified with the --split
option, otherwise the standard index is used.
If no index exists for the reference genome, this option can be specified in order to build an index which is appropriate to the alignment strategy outlined in the given pipeline run. [default: off]
Run only the WGBS alignment workflow and exit before methylation calling. This is appropriate if you have multiplexed reads in separate files which require merging prior to downstream analysis. [default: off]
If you have used a custom spike-in to estimate conversion efficiency in place of 'lambda phage', then you must specify the path to the standalone fasta for splitting the alignments. NOTE: there must also be a corresponding fasta index '*.fa.fai' file in the same directory. [default: data/lambda.fa]
Specify the scaffold/contig from which to estimate conversion efficiency following alignment. Can be combined with a spike-in estimation eg. lambda, as the calculation will be made from the given contig without splitting the alignment file. [default: off]
By default the pipeline will assume that pre-trimmed reads have been given to the --input
parameter, and will therefore skip the trimming process unless specified. Trimmimg will be performed with cutadapt which takes into consideration a number of additional parameters:
--trim
Specify to enable trimming. [default: off]
--forward <ARG>
The forward adapter sequence. [default: AGATCGGAAGAGCACACGTCTGAAC]
--reverse <ARG>
The reverse adapter sequence. [default: AGATCGGAAGAGCGTCGTGTAGGGA]
--clip5 <ARG>
Number of bases for hard clipping at the 5'-end. [default: 0]
--clip3 <ARG>
Number of bases for hard clipping at the 3'-end. [default: 0]
--minQual <ARG>
Minimum base quality threshold for trimming. [default: 20]
--minLeng <ARG>
Minimum length threshold of trimmed sequences. [default: 36]
--minOver <ARG>
Minimum overlap length for adapter sequences in reads. [default: 3]
Generate fastqc report for sequencing reads. Report will be generated for trimmed reads if the adapter trimming process has been enabled. [default: off]
Bisulfite read alignment by default is carried out by erne-bs5, an efficient aligner based on a 5-letter sequence alphabet and an index built on a Burrows-Wheeler Transformation (BWT). Several additional parameters can also be provided to modify the mapping results:
--minIns <ARG>
The minimum insert size between PE reads. [default: 0]
--maxIns <ARG>
The maximum insert size between PE reads. [default: 500]
--maxErrors <ARG>
Specify the maximum number of base errors allowed during read alignment with 'erne-bs5'. This will disable the default auto-error function. [default: off]
--XF <ARG>
Specify a proportion of opposite-strand bisulfite mismatches to allow before filtering. This excludes reads in regions that appear to have undergone genome rearrangement to the opposite strand, as alignment in this case is unreliable. [default: 0.03]
Enable bisulfite read alignment using only segemehl. This aligner has higher precision but is more memory and time intensive than 'erne-bs5'. It is suitable for smaller analyses or for redoing analyses on "interesting" samples. [default: off]
--minAccuracy <ARG>
Specify the minimum percentage threshold for base matches during read alignment with 'segemehl'. [default: 90]
Specify that only uniquely-mapping reads should be used in downstream methylation calling. [default: off]
Run only the methylation calling workflow. This changes the behaviour of the pipeline to expect existing bam files in the --input
directory, such as those generated from a prior run of the WGBS alignment workflow. This is appropriate if you want to do manual processing of bam files (eg. merging RG tags according to sample) or eg. re-doing methylation calling with/without MarkDuplicates [default: off]
Do not call methylated positions in CpG context. Note: at least one methylation context is required for analysis. [default: off]
Do not call methylated positions in CHG context. Note: at least one methylation context is required for analysis. [default: off]
Do not call methylated positions in CHH context. Note: at least one methylation context is required for analysis. [default: off]
If your protocol does not include a spike-in from which to estimate conversion efficiency, then specify this parameter to disable the automatic alignment to eg. lambda genome for this purpose. [default: off]
Skip the deduplication step with Picard MarkDuplicates if for example your protocol does not include PCR. [default: off]
Specify that any trimmed reads generated by the pipeline should be kept after analysis. [default: off]
Specify that any intermediate bam files generated by the pipeline should be kept after analysis, rather than only the final bam file. [default: off]
Specify in order to prevent Nextflow from clearing the work dir cache following a successful pipeline completion. [default: off]
When called with nextflow run epidiverse/wgbs --version
this will display the pipeline version and quit.
When called with nextflow run epidiverse/wgbs --help
this will display the parameter options and quit.
There are different ways to provide the required software dependencies for the pipeline. The recommended method is to use the Conda, Docker or Singularity profiles as provided by the pipeline.
Use this parameter to choose a preset configuration profile. See the installation documentation for more information about profiles.
Profiles available with the pipeline are:
standard
- The default profile, used if
-profile
is not specified. - Uses sensible resource allocation for , runs using the
local
executor (native system calls) and expects all software to be installed and available on the$PATH
. - This profile is mainly designed to be used as a starting point for other configurations and is inherited by most of the other profiles below.
- The default profile, used if
conda
- Builds a conda environment from the environment.yaml file provided by the pipeline
- Requires conda to be installed on your system.
docker
- Launches a docker image pulled from epidiverse/wgbs
- Requires docker to be installed on your system.
singularity
- Launches a singularity image pulled from epidiverse/wgbs
- Requires singularity to be installed on your system.
epi|diverse
- Designed to be used on the EpiDiverse clusters
epi
ordiverse
- Launches jobs using the
SLURM
executor. - Uses pre-built conda environments to provide all software requirements.
- Comes with built in --thlaspi, --fragaria and --populus shortcuts for reference paths.
- Designed to be used on the EpiDiverse clusters
custom
- No configuration at all. Useful if you want to build your own config from scratch and want to avoid loading in the default
base
config for process resource allocation.
- No configuration at all. Useful if you want to build your own config from scratch and want to avoid loading in the default
If you wish to provide your own package containers it is possible to do so by setting the standard
or custom
profile, and then providing your custom package with the command line flags below. These are not required with the the other profiles.
Flag to enable conda. You can provide either a pre-built environment or a *.yaml file.
Flag to enable docker. The image will automatically be pulled from Dockerhub.
Flag to enable use of singularity. The image will automatically be pulled from the internet. If running offline, follow the option with the full path to the image file.
Specify the path to a custom work directory for the pipeline to run with (eg. on a scratch directory)
Provide a file with specified parameters to avoid typing them out on the command line. This is useful for carrying out repeated analyses. A template params file assets/params.txt
has been made available in the pipeline repository.
Provide a custom config file for adapting the pipeline to run on your own computing infrastructure. A template config file assets/custom.config
has been made available in the pipeline repository. This file can be used as a boilerplate for building your own custom config.
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. Give a specific pipeline name as an argument to resume it, otherwise Nextflow will resume the most recent. NOTE: This will not work if the specified run finished successfully and the cache was automatically cleared. (see: --debug
)
Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.