Skip to content

tallnuttrbgv/HybPiper-RBGV

 
 

Repository files navigation

HybPiper-RBGV

Original HybPiper github, wiki and tutorial:

For an explanation of the general purpose of HybPiper, and the approach it takes to generate target locus sequences from sequence capture data, please see the excellent documentation, wiki and tutorial at the mossmatters repo:

HybPiper-RBGV: containerised and pipelined using Singularity and Nextflow

To simplify running HybPiper, I’ve provided a Singularity container based on the Linux distribution Ubuntu 20.04, with all the scripts required to run the HybPiper pipeline (including modifications, additions and bug fixes, see below for details), as well as all the dependencies (BioPython, BLAST, BWA, BBmap, Exonerate, SPAdes, Samtools). The container is called hybpiper-yang-and-smith-rbgv.sif.

To run HybPiper using this container, I’ve provided a Nextflow pipeline that uses the software in the Singularity container. This pipeline runs all HybPiper steps with a single command. The pipeline script is called hybpiper-rbgv-pipeline.nf. It comes with an associated config file called hybpiper-rbgv.config. The only input required is a folder of sequencing reads for your samples, and a target file in .fasta format. The Nextflow pipeline will automatically generate the namelist.txt file required by some of the HybPiper scripts, and will run all HybPiper scripts on each sample in parallel. It also includes an optional read-trimmming step to QC your reads prior to running HybPiper, using the software Trimmomatic. The number of parallel processes running at any time, as well as computing resources given to each process (e.g. number of CPUs, amount of RAM etc) can be configured by the user by modifying the provided config file. The pipeline can be run directly on your local computer, and on an HPC system submitting jobs via a scheduler (e.g. SLURM, PBS, etc).

Input data

Target file

Your target file, which can contain either nucleotide OR protein sequences (see below for corresponding pipeline options), should follow the formatting described for HybPiper here. Briefly, your target genes should be grouped and differentiated by a suffix in the fasta header, consisting of a dash followed by an ID unique to each gene, e.g.:

>AJFN-4471
AATGTTATACAGGATGAAGAGAAACTGAATACTGCAAACTCCGATTGGATGCGGAAATACAAAGGCT...
>Ambtr-4471
AGTGTTATTCAAGATGAAGATGTATTGTCGACAGCCAATGTGGATTGGATGCGGAAATATAAGGGCA...
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGGGGGTAGAGAACATGCTCTATGCTATGCTTTGC...
>Arath-4691
GAGCTTGGATCTATCGCTTGCGCAGCTCTCTGTGCTTGCACTCTTACAATAGCTTCTCCTGTTATTG...
>BHYC-4691
GAAGTGAACTGTGTTGCTTGTGGGTTTCTTGCTGCTCTTGCTGTCACTGCTTCTCCCGTAATCGCTG...
etc...

Read files

You will need to provide the hybpiper-rbgv-pipeline.nf pipeline with either:

a) A directory of paired-end forwards and reverse reads (and, optionally, a file of single-end reads) for each sample; or

b) A directory of single-end reads.

For the read files to be recognised by the pipeline, they should be named according to the default convention:

*_R1.fastq 
*_R2.fastq
*_single.fastq (optional; will be used if running with the flag `--paired_and_single`)

OR

*_R1.fq 
*_R2.fq
*_single.fq (optional; will be used if running with the flag `--paired_and_single`)

NOTE:

  • It’s fine if there’s text after the R1/R1 and before the .fastq/.fq, as long as it's the same for both read files.
  • It’s fine if the input files are gzipped (i.e. suffix .gz).
  • You can provide a folder of single-end reads only (use the flag --single_only if you do). Your read files should be named *_single.fastq (or .fastq.gz, .fq, .fq.gz).
  • You can specify a custom pattern used for read file matching via the parameters --read_pairs_pattern <pattern> or --single_pattern <pattern>.
  • If your samples have been run across multiple lanes, you'll likely want to combine read files for each sample before processing. The pipeline can do this for you - see here for details.

Running on Linux

Please see the Wiki entry Running on Linux.

Running on a Mac (macOS)

Please see the Wiki entry Running on a Mac.

NOTE: Macs using the new Apple M1 chip are not yet supported.

Running on a PC (Windows)

Please see the Wiki entry Running on a PC.

Nextflow pipeline options and parameters

Example run command:

nextflow run hybpiper-rbgv-pipeline.nf -c hybpiper-rbgv.config --illumina_reads_directory reads_for_hybpiper --target_file Angiosperms353_targetSequences.fasta

Options:

Mandatory arguments:

#######################################################################################################################

--illumina_reads_directory <directory>    Path to folder containing illumina read file(s)

--target_file <file>                      File containing fasta sequences of target genes

#######################################################################################################################

Optional arguments:

 -profile <profile>                       Configuration profile to use. Can use multiple (comma separated)
                                          Available: standard (default), slurm
 
 -resume                                  If restarting the pipeline, use cached results from previously completed 
                                          processes

 --cleanup                                Run 'cleanup.py' for each gene directory after 'reads_first.py'

 --nosupercontigs                         Do not create supercontigs. Use longest Exonerate hit only. Default is off.
 
 --bbmap_subfilter <int>                  Ban alignments with more than this many substitutions when performing 
                                          read-pair mapping to supercontig reference (bbmap.sh). Default is 7

 --memory <int>                           Memory (RAM) amount in GB to use for bbmap.sh with exonerate_hits.py. 
                                          Default is 1 GB

 --discordant_reads_edit_distance <int>   Minimum number of base differences between one read of a read pair 
                                          vs the supercontig reference for a read pair to be flagged as discordant. 
                                          Default is 5

 --discordant_reads_cutoff <int>          Minimum number of discordant reads pairs required to flag a supercontigs 
                                          as a potential hybrid of contigs from multiple paralogs. Default is 5

 --merged                                 Merge forward and reverse reads, and run SPAdes assembly with merged and 
                                          unmerged (the latter in interleaved format) data. Default is off

 --paired_and_single                      Use when providing both paired R1 and R2 read files as well as a file of 
                                          single-end reads for each sample

 --single_only                            Use when providing providing only a folder of single-end reads

 --outdir <directory_name>                Specify the name of the pipeline results directory. Default is 'results'

 --read_pairs_pattern <pattern>           Provide a comma-separated read pair pattern for matching forwards and 
                                          reverse paired-end read-files. Default is 'R1,R2'

 --single_pattern <pattern>               Provide a pattern for matching single-end read files. Default is 'single'

 --use_blastx                             Use a protein target file and map reads to targets with BLASTx. Default 
                                          is a nucleotide target file and mapping of reads to targets using BWA

 --num_forks <int>                        Specify the number of parallel processes (e.g. concurrent runs of 
                                          reads.first.py) to run at any one time. Can be used to prevent Nextflow 
                                          from using all the threads/cpus on your machine. Default is to use the 
                                          maximum number possible

 --cov_cutoff <int>                       Coverage cutoff to pass to the SPAdes assembler. Default is 8

 --blastx_evalue <value>                  Evalue to pass to blastx when using blastx mapping (i.e. when the 
                                          --use_blastx flag is specified). Default is 1e-4

 --paralog_warning_min_len_percent <decimal>
                                          Minimum length percentage of a contig vs reference protein length for 
                                          a paralog warning to be generated and a putative paralog contig to be 
                                          recovered. Default is 0.75

 --translate_target_file_for_blastx       Translate a nucleotide target file. If set, the --use_blastx is set 
                                          by default. Default is off

 --use_trimmomatic                        Trim forwards and reverse reads using Trimmomatic. Default is off

 --trimmomatic_leading_quality <int>      Cut bases off the start of a read, if below this threshold quality. 
                                          Default is 3

 --trimmomatic_trailing_quality <int>     Cut bases off the end of a read, if below this threshold quality. 
                                          Default is 3

 --trimmomatic_min_length <int>           Drop a read if it is below this specified length. Default is 36

 --trimmomatic_sliding_window_size <int>  Size of the sliding window used by Trimmomatic; specifies the number of 
                                          bases to average across. Default is 4

 --trimmomatic_sliding_window_quality <int>
                                          Specifies the average quality required within the sliding window. 
                                          Default is 20

 --run_intronerate                        Run intronerate.py to recover (hopefully) intron and supercontig sequences.
                                          Default is off, and so results `subfolders 09_sequences_intron` and 
                                          `10_sequences_supercontig` will be empty
                                      
 --combine_read_files                     Group and concatenate read-files via a common prefix. Useful if samples 
                                          have been run across multiple lanes. Default prefix is all text preceding 
                                          the first underscore (_) in read filenames
                                          
 --combine_read_files_num_fields <int>    Number of fields (delimited by an underscore) to use for combining read 
                                          files when using the `--combine_read_files` flag. Default is 1

Please see the Wiki entry Additional pipeline features and details for further explanation of the parameters above, and general pipeline functionality.

Output folders and files

If you're just after the unaligned .fasta files for each of your target genes (not including putative paralogs), the two main output folders of interest are probably:

  • 07_sequences_dna
  • 08_sequences_aa

If you need the per-sample output folders produced by a standard HybPiper run, these can be found in folder:

  • 04_processed_gene_directories

For a full explanation of output folders and files, please see the Wiki entry Output folders and files.

General information

For details on adapting the pipeline to run on local and HPC computing resources, see here.

Bug fixes and changes (WIP)

Please see the Wiki entry Bug fixes and changes.

Issues still to deal with (WIP)

Please see the Wiki entry Issues.

Changelog

09 November 2021

  • Nextflow script updated to DSL2. NOTE: Nextflow version >= 21.04.1 is now required!
  • Singularity *.def file updated to use Ubuntu 20.04, and to install Muscle and FastTreeMP.

About

Nextflow/Singularity pipeline for running modified scripts from HybPiper (https://github.com/mossmatters/HybPiper)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Nextflow 100.0%