Skip to content

v1.0.0

Compare
Choose a tag to compare
@sage-wright sage-wright released this 12 Aug 17:10
· 222 commits to main since this release
488e95d

PHBG v1.0.0 Release Notes

This major release introduces a stable and validated version of the TheiaProk workflow series.

This release also offers two new workflows (TheiaProk_Illumina_SE and RASUSA) and multiple organism-agnostic modules described in more detail below.


About TheiaProk

The TheiaProk workflows are for assembly and characterization of prokaryotic genomes, principally bacteria. All input reads go through steps in the core workflow for read trimming and assembly, quality assessment, species identification, and resistance gene identification. Sub-workflows further characterize some genomes, with activation of these processes dependent on the taxa identified.

TheiaProk_Illumina_PE

Currently, TheiaProk has two forms: for Illumina paired-end sequencing data (TheiaProk_Illumina_PE), and for Illumina single-end sequencing data (TheiaProk_Illumina_SE). Future plans include development of workflows for alternative sequence data types, like Oxford Nanopore.


The following information describes the changes since the v0.6.0 version.

New modules to the TheiaProk workflows

The following modules are new additions to the core sample characterization performed on all organisms after genome assembly. While most of these are run by default, several modules can be enabled through the usage of a Boolean input parameter. More information about each tool can be found by clicking on the associated links.

  • Gene Typing
    • PlasmidFinder - identifies plasmid replicon genes in total or partial sequenced isolates of bacteria (default)
    • Prokka - annotates bacterial genomes quickly and produces standards-compliant output files (default)
    • ResFinder - identifies acquired antimicrobial resistance genes in total or partial sequenced isolates of bacteria (optional; set call_resfinder to true to enable)
  • Quality Control
    • BUSCO- “provide[s] a quantitative assessment of the completeness in terms of expected gene content” (default)
    • Mummer ANI - calculates Average Nucleotide Identify (ANI) using MUMmer and an ANI calculation script from Lee Katz (optional; set call_ani to true to enable)

SKESA as default assembler

Through extensive validation and analysis, we have made the decision to switch our default parameter from SPAdes to SKESA. We have observed that the more conservative assemblies generated with SKESA led to greater concordance with known epidemiological relationships downstream while maintaining an ability to accurately characterize pathogen genomic data with respect to taxon prediction, serotyping, and AMR gene detection.

The SPAdes assembler can still be used through usage of an input variable (for TheiaProk_Illumina_PE, set shovill_pe.assembler to “spades”; for TheiaProk_Illumina_SE, set shovill_se.assembler to “spades”).

New workflows

  • TheiaProk_Illumina_SE - this workflow is equivalent to TheiaProk_Illumina_PE but is intended for Illumina single-end sequencing data; all modules are the same, except when appropriate, single-end-specific versions and parameters are used.
  • RASUSA - a workflow that will randomly subsample reads to a specified coverage using RASUSA.

Other changes

  • GitHub Actions for automated testing and continuous integration were added to the PHBG repository!
  • The export_taxon_tables task now can handle extra large fastq files.
  • Shovill parameters have been exposed so advanced users can select their own assemblers and customize assembly parameters to their heart’s content.
  • kSNP3 distance matrices were previously completely unordered. These SNP matrices are now more ordered than they were before. These semi-ordered SNP matrices appear most often when multiple outbreak groups are included in a kSNP3 analysis. Future releases will include the addition of fully-ordered SNP matrices.
  • The kSNP3 workflow now produces SNP distance matrices and phylogenetic trees generated using both pangenome and core genome analyses.

Log of PRs

Full Changelog: v0.6.0...v1.0.0

Follow Theiagen on Twitter!