v1.0.0
PHBG v1.0.0 Release Notes
This major release introduces a stable and validated version of the TheiaProk workflow series.
This release also offers two new workflows (TheiaProk_Illumina_SE
and RASUSA
) and multiple organism-agnostic modules described in more detail below.
About TheiaProk
The TheiaProk workflows are for assembly and characterization of prokaryotic genomes, principally bacteria. All input reads go through steps in the core workflow for read trimming and assembly, quality assessment, species identification, and resistance gene identification. Sub-workflows further characterize some genomes, with activation of these processes dependent on the taxa identified.
Currently, TheiaProk has two forms: for Illumina paired-end sequencing data (TheiaProk_Illumina_PE
), and for Illumina single-end sequencing data (TheiaProk_Illumina_SE
). Future plans include development of workflows for alternative sequence data types, like Oxford Nanopore.
The following information describes the changes since the v0.6.0 version.
New modules to the TheiaProk workflows
The following modules are new additions to the core sample characterization performed on all organisms after genome assembly. While most of these are run by default, several modules can be enabled through the usage of a Boolean input parameter. More information about each tool can be found by clicking on the associated links.
- Gene Typing
- PlasmidFinder - identifies plasmid replicon genes in total or partial sequenced isolates of bacteria (default)
- Prokka - annotates bacterial genomes quickly and produces standards-compliant output files (default)
- ResFinder - identifies acquired antimicrobial resistance genes in total or partial sequenced isolates of bacteria (optional; set
call_resfinder
totrue
to enable)
- Quality Control
- BUSCO- “provide[s] a quantitative assessment of the completeness in terms of expected gene content” (default)
- Mummer ANI - calculates Average Nucleotide Identify (ANI) using MUMmer and an ANI calculation script from Lee Katz (optional; set
call_ani
totrue
to enable)
SKESA as default assembler
Through extensive validation and analysis, we have made the decision to switch our default parameter from SPAdes to SKESA. We have observed that the more conservative assemblies generated with SKESA led to greater concordance with known epidemiological relationships downstream while maintaining an ability to accurately characterize pathogen genomic data with respect to taxon prediction, serotyping, and AMR gene detection.
The SPAdes assembler can still be used through usage of an input variable (for TheiaProk_Illumina_PE
, set shovill_pe.assembler
to “spades”; for TheiaProk_Illumina_SE
, set shovill_se.assembler
to “spades”).
New workflows
- TheiaProk_Illumina_SE - this workflow is equivalent to TheiaProk_Illumina_PE but is intended for Illumina single-end sequencing data; all modules are the same, except when appropriate, single-end-specific versions and parameters are used.
- RASUSA - a workflow that will randomly subsample reads to a specified coverage using RASUSA.
Other changes
- GitHub Actions for automated testing and continuous integration were added to the PHBG repository!
- The
export_taxon_tables
task now can handle extra large fastq files. - Shovill parameters have been exposed so advanced users can select their own assemblers and customize assembly parameters to their heart’s content.
- kSNP3 distance matrices were previously completely unordered. These SNP matrices are now more ordered than they were before. These semi-ordered SNP matrices appear most often when multiple outbreak groups are included in a kSNP3 analysis. Future releases will include the addition of fully-ordered SNP matrices.
- The kSNP3 workflow now produces SNP distance matrices and phylogenetic trees generated using both pangenome and core genome analyses.
Log of PRs
- Fja export tt highmem dev by @frankambrosio3 in #119
- exposing shovill parameters by @sage-wright in #121
- Order SNP-Dists Matrix by @kevinlibuit in #125
- Add GitHub Actions to PHBG by @rpetit3 in #123
- Add RASUSA Task and Workflow File by @kevinlibuit in #122
- add TheiaProk_Illumina_SE workflow by @sage-wright in #124
- typo fix by @frankambrosio3 in #128
- adds ANI to theiaprok_illumina_pe wf by @kapsakcj in #126
- Add BUSCO by @sage-wright in #127
- Adds resfinder to theiaprok pe and se by @michellescribner in #130
- Adds prokka, plasmidfinder, ksnp3 core, wf_pangenome by @michellescribner in #129
- Change default assembler to skesa by @sage-wright in #132
- remove compression from alignment files by @michellescribner in #134
Full Changelog: v0.6.0...v1.0.0