Skip to content

Releases: theiagen/public_health_bacterial_genomics

Release notes for version 1.3.0

21 Apr 15:26
0e380f9
Compare
Choose a tag to compare

This minor release implements an enhancement including a new species-specific genomic characterization module for Vibrio spp

New task

Vibrio spp.

One new task has been implemented for Vibrio spp. genomic characterization: Vibrio Characterization through SRST2 and a custom database.
Information for this new task is included in the latest TheiaProk documentation.

What's Changed

Full Changelog: v1.2.0...v.1.3.0

v1.2.0

28 Mar 14:11
384c1a0
Compare
Choose a tag to compare

This minor release implements several enhancements and improvements to the species-specific genomic characterization modules

New tasks

Staphylococcus aureus

Three new tasks have been implemented for S. aureus genomic characterization: spatyper,
staphopia-sccmec and agrvate.

Neisseria spp.

  • Neisseria gonorrhoeae: a new task, ngmaster, has been implemented.

  • Neisseria meningitidis: a new task, meningotype, has been implemented.

Mycobacterium tuberculosis

The tbprofiler task now has an additional set of outputs that can be accessed by
setting tbprofiler_additional_outputs option to true.

What's Changed

Full Changelog: v1.1.1...v1.2.0

Follow us on Twitter!

v1.1.1

31 Jan 19:44
42659de
Compare
Choose a tag to compare

This patch release implements several enhancements and improvements to the phylogenetic workflows

For the kSNP3, Mashtree, and Core_Gene_SNP workflows, several changes have been implemented.

A new task was created, reorder_matrix that performs the following:

  1. Phylogenetic trees have been midpoint-rooted to improve appearance. Final trees from these workflows are now midpoint-rooted.
  2. Previously, SNP matrices were not ordered. Now, they are ordered to match the order of terminal ends in the midpoint-rooted phylogenetic tree.
  3. Phandango coloring is automatically applied to all column headers in matrices (:c1); these matrix files are .csv files for easy transfer/upload to Phandango.

A new task was created, summarize_data that performs the following:

  1. Digests a comma-separated list of column names
  2. Parses through those column contents
  3. Outputs a .csv file that indicates presence (TRUE)/absence (empty cell) for each item in those columns.
  4. A Boolean option phandango_coloring will color all items from the same column in the same format; rows are ordered according to the terminal ends of the midpoint-rooted tree for easy transfer/uplod to Phandango.

These two tasks have been added to all three phylogenetic workflows in the PHBG repository.

Other modifications

ShigEiFinder

A new optional task was created that allows ShigEiFinder to be run with read files as inputs instead of assemblies. 10 new output columns were created that are identical to the task that uses assemblies as input except they have the _reads suffix to differentiate between them. To use, set the new optional input variable call_shigeifinder_reads_input to true. This task is not run by default.

AMRFinderPlus

A typo has been corrected in the AMRFinderPlus task; the previous "pnemoniae" has now been corrected to "pneumoniae"

New TheiaProk QC

Several new columns are now being outputted that report the following:

  • r1, r2, and combined raw mean quality scores and read lengths
  • combined clean mean quality scores and read lengths (no individual r1 and r2 to avoid excessive column creep)

Clean mean quality scores and read lengths are now able to be checked in the qc_check task as well.

What's Changed

Full Changelog: v1.1.0...v1.1.1

Follow us on Twitter!

v1.1.0

29 Dec 21:14
870ae7f
Compare
Choose a tag to compare

PHBG v1.1.0 Release Notes

This minor release introduces multiple modules to the TheiaProk workflow series as well as a new workflow for performing core gene phylogenetic analysis (Core_Gene_SNP).

Updates to the TheiaProk Workflow Series

Taxon-specific modules added:

  • Acinetobacter baumannii: Kaptive (detection of surface polysaccharide loci for A. baumannii) & AcinetobacterPlasmid Typing (plasmid typing of A. baumannii using abricate with the custom A. baumannii plasmid typing database)
  • Pseudomonas aeruginosa: Pasty (tool to identify the serogroup of P. aeruginosa isolates)
  • Shigella spp.: ShigaTyper (tool designed to determine Shigella serotype), ShigEiFinder (tool that is used to identify differentiate Shigella/EIEC using cluster-specific genes and identify the serotype using O-antigen/H-antigen genes), SonneiTyper (tool to identify input genomes as S. sonnei, assign those identified as S. sonnei to hierarchical genotypes based on detection of single nucleotide variants)
  • Streptococcus pneuomniae: GPS unified workflow (PopPUNK (tool for in silico Penicillin Binding Protein (PBP) typing), SeroBA (tool for S. pneumoniae serotyping), PBPTyper with Global Pneumococcal Sequencing (GPS) database v6 for GPS Cluster assignment

QC and read processing modules added:

  • Option to quantify secondary genus abundance using the MIDAS
  • Option to utilize fastp rather than trimmomatic for read processing
  • Option to utilize bakta rather than prokka for genome annotation
  • Option to perform a QC check--i.e. determine QC Pass or QC Alert based on user-defined thresholds for multiple QC metrics

Column output updates:

  • genome_length renamed to assembly_length
  • est_coverage renamed to est_coverage_raw (est_coverage_clean column output added)
    • Note: Assembly length calculated by quast is used to calculate estimated coverage rather than the estimated genome length produced from the mash sketch

Core Gene SNP Workflow

The Core_Gene_SNP workflow is a flexible workflow intended for core gene alignment and phylogenetic analysis of a set of samples. The workflow takes in gene sequence data in GFF3 format from a set of samples. It first produces a pangenome summary using Pirate, which clusters genes within the sample set into orthologous gene families. By default, the workflow also instructs Pirate to produce both core genome and pangenome alignments.

The workflow subsequently triggers the generation of a SNP distance matrix and a phylogenetic tree using the core genome alignment via snp-dists and iqtree, respectively. Optionally, the workflow will also run this analysis using the pangenome alignment.

Other Modifications

  • AMRFinderPlus task modifications:
    • Default docker image updated to v3.10.26 and output database version
    • Drug class outputs brought to Terra data table
  • kSNP3 task/workflow modifications
    • tree Newick file output extensions changed to .nwk
  • Gambit docker task modified to utilize GAMBIT v0.5.0
  • TS_MLST task modified to utilize MLST v2.23.0

New Documentation

Detailed documentation has been created for all workflows in the PHBG v1.1.0 repository.


What's Changed

New Contributors

Full Changelog: v1.0.0...1.1.0

v1.0.0

12 Aug 17:10
488e95d
Compare
Choose a tag to compare

PHBG v1.0.0 Release Notes

This major release introduces a stable and validated version of the TheiaProk workflow series.

This release also offers two new workflows (TheiaProk_Illumina_SE and RASUSA) and multiple organism-agnostic modules described in more detail below.


About TheiaProk

The TheiaProk workflows are for assembly and characterization of prokaryotic genomes, principally bacteria. All input reads go through steps in the core workflow for read trimming and assembly, quality assessment, species identification, and resistance gene identification. Sub-workflows further characterize some genomes, with activation of these processes dependent on the taxa identified.

TheiaProk_Illumina_PE

Currently, TheiaProk has two forms: for Illumina paired-end sequencing data (TheiaProk_Illumina_PE), and for Illumina single-end sequencing data (TheiaProk_Illumina_SE). Future plans include development of workflows for alternative sequence data types, like Oxford Nanopore.


The following information describes the changes since the v0.6.0 version.

New modules to the TheiaProk workflows

The following modules are new additions to the core sample characterization performed on all organisms after genome assembly. While most of these are run by default, several modules can be enabled through the usage of a Boolean input parameter. More information about each tool can be found by clicking on the associated links.

  • Gene Typing
    • PlasmidFinder - identifies plasmid replicon genes in total or partial sequenced isolates of bacteria (default)
    • Prokka - annotates bacterial genomes quickly and produces standards-compliant output files (default)
    • ResFinder - identifies acquired antimicrobial resistance genes in total or partial sequenced isolates of bacteria (optional; set call_resfinder to true to enable)
  • Quality Control
    • BUSCO- “provide[s] a quantitative assessment of the completeness in terms of expected gene content” (default)
    • Mummer ANI - calculates Average Nucleotide Identify (ANI) using MUMmer and an ANI calculation script from Lee Katz (optional; set call_ani to true to enable)

SKESA as default assembler

Through extensive validation and analysis, we have made the decision to switch our default parameter from SPAdes to SKESA. We have observed that the more conservative assemblies generated with SKESA led to greater concordance with known epidemiological relationships downstream while maintaining an ability to accurately characterize pathogen genomic data with respect to taxon prediction, serotyping, and AMR gene detection.

The SPAdes assembler can still be used through usage of an input variable (for TheiaProk_Illumina_PE, set shovill_pe.assembler to “spades”; for TheiaProk_Illumina_SE, set shovill_se.assembler to “spades”).

New workflows

  • TheiaProk_Illumina_SE - this workflow is equivalent to TheiaProk_Illumina_PE but is intended for Illumina single-end sequencing data; all modules are the same, except when appropriate, single-end-specific versions and parameters are used.
  • RASUSA - a workflow that will randomly subsample reads to a specified coverage using RASUSA.

Other changes

  • GitHub Actions for automated testing and continuous integration were added to the PHBG repository!
  • The export_taxon_tables task now can handle extra large fastq files.
  • Shovill parameters have been exposed so advanced users can select their own assemblers and customize assembly parameters to their heart’s content.
  • kSNP3 distance matrices were previously completely unordered. These SNP matrices are now more ordered than they were before. These semi-ordered SNP matrices appear most often when multiple outbreak groups are included in a kSNP3 analysis. Future releases will include the addition of fully-ordered SNP matrices.
  • The kSNP3 workflow now produces SNP distance matrices and phylogenetic trees generated using both pangenome and core genome analyses.

Log of PRs

Full Changelog: v0.6.0...v1.0.0

Follow Theiagen on Twitter!

v0.6.0

30 Jun 15:42
cb0b9c2
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.5.0...v0.6.0

v0.5.0

12 May 18:22
43f3c88
Compare
Choose a tag to compare

What's Changed

  • Add 2 kraken2 workflows (Single End & Paired End) by @rpetit3 in #70
  • New NCBI-AMRFinderPlus workflow and integration in TheiaProk_Illumina_PE wf by @kapsakcj in #65
    • TheiaProk_Illumina_PE workflow - replaced abricate task with NCBI-AMRFinderPlus for AMR gene detection
    • Fixed integer math in read_screen task by @sage-wright , also in #65
  • mlst task updated to 2.22.0 (default docker image updated to staphb/mlst:2.22.0)
  • updated gambit_query workflow with updated task (gambit v0.4.0) by @kevinlibuit
  • export_taxon_tables feature now includes NCBI-AMRFinderPlus outputs

Full Changelog: v0.4.0...v0.5.0

v0.4.0

08 Apr 20:54
53d351a
Compare
Choose a tag to compare

This release adds MLST profiling to the TheiaProk_Illumina_PE workflow.

Additional updates to TheiaProk_Illumina_PE:

  • Data screening task added to avoid workflow failures caused by low-quality input read data
  • QC metrics adjusted for WGS bacterial data
  • Capture of n50 from Quast report (Thanks, @erikwolfsohn!)
  • Exposure of minimum percent length and coverage parameters exposed in Abricate task
  • Replacing the Quast assembly length with the Mash estimated genome size for the cg-pipeline read coverage calculations
  • Allow for additional fields of metadata to be exported to taxon tables: collection_date, originating_lab, city, county, zip

v0.3.0

10 Mar 00:02
a0b4c33
Compare
Choose a tag to compare

This release renames the Apollo_Illumina_PE workflow to TheiaProk_Illumina_PE workflow restructures the PHBG task directory

The TheiaProk workflow was developed to replace Apollo workflows for bacterial genomic characterization. TheiaProk is based off of @rpetit3's Bactopia and its Merlin subworkflow and differs from the original Apollo workflows in its organism-typing subworkflow merlin_magic.

This subworkflow triggers organism typing based on gambit taxon assignments for each sample, e.g. serotyping via SeroTypeFinder will be performed for samples with an Escherichia gambit taxon assignment.

TheiaProk organism typing will be performed for the following organisms using the listed bioinformatics software:

TheiaProk_Illumina_PE will also perform AMR gene detection using abricate against the NCBI AMRFinderPlus database.

Additionally, the PHBG directory structure was reformatted for ease of use and readability.

v0.2

27 Jul 15:08
ec4d057
Compare
Choose a tag to compare

Release to add the Kleborate and SerotypeFinder workflows

  • Available as tasks within the task_taxon_id.wdl file as well as stand-alone single-task workflows; both available on Terra via DockStore

Other Changes:

  • Version and analysis date captured for every workflow
  • Shovill task modified to include optional minimum contig length (default set to 200bp); this default setting is utilized in the Apollo_Illumina_PE workflow
  • White space inconsistencies addressed
  • Apollo_Illumina_PE output name changes:
    • predicted_genus → gamibit_genus
    • predicted_species → gamibit_species
  • Validation files directory created for local testing