For an explanation of the general purpose of HybPiper, and the approach it takes to generate target locus sequences from sequence capture data, please see the excellent documentation, wiki and tutorial at the mossmatters
repo:
- https://github.com/mossmatters/HybPiper/
- https://github.com/mossmatters/HybPiper/wiki
- https://github.com/mossmatters/HybPiper/wiki/Tutorial
To simplify running HybPiper, I’ve provided a Singularity container based on the Linux distribution Ubuntu 20.04, with all the scripts required to run the HybPiper pipeline (including modifications, additions and bug fixes, see below for details), as well as all the dependencies (BioPython, BLAST, BWA, BBmap, Exonerate, SPAdes, Samtools). The container is called hybpiper-yang-and-smith-rbgv.sif
.
To run HybPiper using this container, I’ve provided a Nextflow pipeline that uses the software in the Singularity container. This pipeline runs all HybPiper steps with a single command. The pipeline script is called hybpiper-rbgv-pipeline.nf
. It comes with an associated config file called hybpiper-rbgv.config
. The only input required is a folder of sequencing reads for your samples, and a target file in .fasta
format. The Nextflow pipeline will automatically generate the namelist.txt
file required by some of the HybPiper scripts, and will run all HybPiper scripts on each sample in parallel. It also includes an optional read-trimmming step to QC your reads prior to running HybPiper, using the software Trimmomatic. The number of parallel processes running at any time, as well as computing resources given to each process (e.g. number of CPUs, amount of RAM etc) can be configured by the user by modifying the provided config file. The pipeline can be run directly on your local computer, and on an HPC system submitting jobs via a scheduler (e.g. SLURM, PBS, etc).
Your target file
, which can contain either nucleotide OR protein sequences (see below for corresponding pipeline options), should follow the formatting described for HybPiper here. Briefly, your target genes should be grouped and differentiated by a suffix in the fasta header, consisting of a dash followed by an ID unique to each gene, e.g.:
>AJFN-4471
AATGTTATACAGGATGAAGAGAAACTGAATACTGCAAACTCCGATTGGATGCGGAAATACAAAGGCT...
>Ambtr-4471
AGTGTTATTCAAGATGAAGATGTATTGTCGACAGCCAATGTGGATTGGATGCGGAAATATAAGGGCA...
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGGGGGTAGAGAACATGCTCTATGCTATGCTTTGC...
>Arath-4691
GAGCTTGGATCTATCGCTTGCGCAGCTCTCTGTGCTTGCACTCTTACAATAGCTTCTCCTGTTATTG...
>BHYC-4691
GAAGTGAACTGTGTTGCTTGTGGGTTTCTTGCTGCTCTTGCTGTCACTGCTTCTCCCGTAATCGCTG...
etc...
You will need to provide the hybpiper-rbgv-pipeline.nf
pipeline with either:
a) A directory of paired-end forwards and reverse reads (and, optionally, a file of single-end reads) for each sample; or
b) A directory of single-end reads.
For the read files to be recognised by the pipeline, they should be named according to the default convention:
*_R1.fastq
*_R2.fastq
*_single.fastq (optional; will be used if running with the flag `--paired_and_single`)
OR
*_R1.fq
*_R2.fq
*_single.fq (optional; will be used if running with the flag `--paired_and_single`)
NOTE:
- It’s fine if there’s text after the
R1
/R1
and before the.fastq
/.fq
, as long as it's the same for both read files. - It’s fine if the input files are gzipped (i.e. suffix
.gz
). - You can provide a folder of single-end reads only (use the flag
--single_only
if you do). Your read files should be named*_single.fastq
(or.fastq.gz
,.fq
,.fq.gz
). - You can specify a custom pattern used for read file matching via the parameters
--read_pairs_pattern <pattern>
or--single_pattern <pattern>
. - If your samples have been run across multiple lanes, you'll likely want to combine read files for each sample before processing. The pipeline can do this for you - see here for details.
Please see the Wiki entry Running on Linux.
Please see the Wiki entry Running on a Mac.
NOTE: Macs using the new Apple M1 chip are not yet supported.
Please see the Wiki entry Running on a PC.
Example run command:
nextflow run hybpiper-rbgv-pipeline.nf -c hybpiper-rbgv.config --illumina_reads_directory reads_for_hybpiper --target_file Angiosperms353_targetSequences.fasta
Options:
Mandatory arguments:
#######################################################################################################################
--illumina_reads_directory <directory> Path to folder containing illumina read file(s)
--target_file <file> File containing fasta sequences of target genes
#######################################################################################################################
Optional arguments:
-profile <profile> Configuration profile to use. Can use multiple (comma separated)
Available: standard (default), slurm
-resume If restarting the pipeline, use cached results from previously completed
processes
--cleanup Run 'cleanup.py' for each gene directory after 'reads_first.py'
--nosupercontigs Do not create supercontigs. Use longest Exonerate hit only. Default is off.
--bbmap_subfilter <int> Ban alignments with more than this many substitutions when performing
read-pair mapping to supercontig reference (bbmap.sh). Default is 7
--memory <int> Memory (RAM) amount in GB to use for bbmap.sh with exonerate_hits.py.
Default is 1 GB
--discordant_reads_edit_distance <int> Minimum number of base differences between one read of a read pair
vs the supercontig reference for a read pair to be flagged as discordant.
Default is 5
--discordant_reads_cutoff <int> Minimum number of discordant reads pairs required to flag a supercontigs
as a potential hybrid of contigs from multiple paralogs. Default is 5
--merged Merge forward and reverse reads, and run SPAdes assembly with merged and
unmerged (the latter in interleaved format) data. Default is off
--paired_and_single Use when providing both paired R1 and R2 read files as well as a file of
single-end reads for each sample
--single_only Use when providing providing only a folder of single-end reads
--outdir <directory_name> Specify the name of the pipeline results directory. Default is 'results'
--read_pairs_pattern <pattern> Provide a comma-separated read pair pattern for matching forwards and
reverse paired-end read-files. Default is 'R1,R2'
--single_pattern <pattern> Provide a pattern for matching single-end read files. Default is 'single'
--use_blastx Use a protein target file and map reads to targets with BLASTx. Default
is a nucleotide target file and mapping of reads to targets using BWA
--num_forks <int> Specify the number of parallel processes (e.g. concurrent runs of
reads.first.py) to run at any one time. Can be used to prevent Nextflow
from using all the threads/cpus on your machine. Default is to use the
maximum number possible
--cov_cutoff <int> Coverage cutoff to pass to the SPAdes assembler. Default is 8
--blastx_evalue <value> Evalue to pass to blastx when using blastx mapping (i.e. when the
--use_blastx flag is specified). Default is 1e-4
--paralog_warning_min_len_percent <decimal>
Minimum length percentage of a contig vs reference protein length for
a paralog warning to be generated and a putative paralog contig to be
recovered. Default is 0.75
--translate_target_file_for_blastx Translate a nucleotide target file. If set, the --use_blastx is set
by default. Default is off
--use_trimmomatic Trim forwards and reverse reads using Trimmomatic. Default is off
--trimmomatic_leading_quality <int> Cut bases off the start of a read, if below this threshold quality.
Default is 3
--trimmomatic_trailing_quality <int> Cut bases off the end of a read, if below this threshold quality.
Default is 3
--trimmomatic_min_length <int> Drop a read if it is below this specified length. Default is 36
--trimmomatic_sliding_window_size <int> Size of the sliding window used by Trimmomatic; specifies the number of
bases to average across. Default is 4
--trimmomatic_sliding_window_quality <int>
Specifies the average quality required within the sliding window.
Default is 20
--run_intronerate Run intronerate.py to recover (hopefully) intron and supercontig sequences.
Default is off, and so results `subfolders 09_sequences_intron` and
`10_sequences_supercontig` will be empty
--combine_read_files Group and concatenate read-files via a common prefix. Useful if samples
have been run across multiple lanes. Default prefix is all text preceding
the first underscore (_) in read filenames
--combine_read_files_num_fields <int> Number of fields (delimited by an underscore) to use for combining read
files when using the `--combine_read_files` flag. Default is 1
Please see the Wiki entry Additional pipeline features and details for further explanation of the parameters above, and general pipeline functionality.
If you're just after the unaligned .fasta
files for each of your target genes (not including putative paralogs), the two main output folders of interest are probably:
07_sequences_dna
08_sequences_aa
If you need the per-sample output folders produced by a standard HybPiper run, these can be found in folder:
04_processed_gene_directories
For a full explanation of output folders and files, please see the Wiki entry Output folders and files.
For details on adapting the pipeline to run on local and HPC computing resources, see here.
Please see the Wiki entry Bug fixes and changes.
Please see the Wiki entry Issues.
09 November 2021
- Nextflow script updated to DSL2. NOTE: Nextflow version >= 21.04.1 is now required!
- Singularity *.def file updated to use Ubuntu 20.04, and to install Muscle and FastTreeMP.