Releases: theiagen/public_health_bacterial_genomics
Release notes for version 1.3.0
This minor release implements an enhancement including a new species-specific genomic characterization module for Vibrio spp
New task
Vibrio spp.
One new task has been implemented for Vibrio spp. genomic characterization: Vibrio Characterization through SRST2 and a custom database.
Information for this new task is included in the latest TheiaProk documentation.
What's Changed
- Incorporate vibrio characterisation with srst2 into TheiaProk workflows by @cimendes in #216
- Amendment to Vibrio subworkflow by @emmadoughty in #228
- update version task to "PHBG v1.3.0" by @kapsakcj in #229
- Updated task description by @emmadoughty in #230
Full Changelog: v1.2.0...v.1.3.0
v1.2.0
This minor release implements several enhancements and improvements to the species-specific genomic characterization modules
New tasks
Staphylococcus aureus
Three new tasks have been implemented for S. aureus genomic characterization: spatyper,
staphopia-sccmec and agrvate.
Neisseria spp.
-
Neisseria gonorrhoeae: a new task, ngmaster, has been implemented.
-
Neisseria meningitidis: a new task, meningotype, has been implemented.
Mycobacterium tuberculosis
The tbprofiler task now has an additional set of outputs that can be accessed by
setting tbprofiler_additional_outputs
option to true
.
What's Changed
- mlst: new String output "ts_mlst_allelic_profile" by @kapsakcj in #209
- Add neisseria subwf: ngmaster and meningotype by @kapsakcj in #211
- New output columns from TBProfiler Task by @cimendes in #217
- Adds Staph aureus subwf by @kapsakcj in #213
- Update README.md by @kevinlibuit in #220
- Add comma by @sage-wright in #222
- add missing tbprofiler optional outputs to export_taxon_tables inputs by @cimendes in #224
- Update version to v1.2.0 by @sage-wright in #223
Full Changelog: v1.1.1...v1.2.0
v1.1.1
This patch release implements several enhancements and improvements to the phylogenetic workflows
For the kSNP3, Mashtree, and Core_Gene_SNP workflows, several changes have been implemented.
A new task was created, reorder_matrix
that performs the following:
- Phylogenetic trees have been midpoint-rooted to improve appearance. Final trees from these workflows are now midpoint-rooted.
- Previously, SNP matrices were not ordered. Now, they are ordered to match the order of terminal ends in the midpoint-rooted phylogenetic tree.
- Phandango coloring is automatically applied to all column headers in matrices (
:c1
); these matrix files are .csv files for easy transfer/upload to Phandango.
A new task was created, summarize_data
that performs the following:
- Digests a comma-separated list of column names
- Parses through those column contents
- Outputs a .csv file that indicates presence (TRUE)/absence (empty cell) for each item in those columns.
- A Boolean option
phandango_coloring
will color all items from the same column in the same format; rows are ordered according to the terminal ends of the midpoint-rooted tree for easy transfer/uplod to Phandango.
These two tasks have been added to all three phylogenetic workflows in the PHBG repository.
Other modifications
ShigEiFinder
A new optional task was created that allows ShigEiFinder to be run with read files as inputs instead of assemblies. 10 new output columns were created that are identical to the task that uses assemblies as input except they have the _reads
suffix to differentiate between them. To use, set the new optional input variable call_shigeifinder_reads_input
to true. This task is not run by default.
AMRFinderPlus
A typo has been corrected in the AMRFinderPlus task; the previous "pnemoniae" has now been corrected to "pneumoniae"
New TheiaProk QC
Several new columns are now being outputted that report the following:
- r1, r2, and combined raw mean quality scores and read lengths
- combined clean mean quality scores and read lengths (no individual r1 and r2 to avoid excessive column creep)
Clean mean quality scores and read lengths are now able to be checked in the qc_check
task as well.
What's Changed
- Update task_amrfinderplus.wdl by @kevinlibuit in #204
- add optional task for shigeifinder w/ reads as input; update default docker for both shigeifinder & shigatyper by @kapsakcj in #202
- Fja readlength dev by @frankambrosio3 in #201
- reorder snp matrix by @sage-wright in #198
Full Changelog: v1.1.0...v1.1.1
v1.1.0
PHBG v1.1.0 Release Notes
This minor release introduces multiple modules to the TheiaProk workflow series as well as a new workflow for performing core gene phylogenetic analysis (Core_Gene_SNP).
Updates to the TheiaProk Workflow Series
Taxon-specific modules added:
- Acinetobacter baumannii: Kaptive (detection of surface polysaccharide loci for A. baumannii) & AcinetobacterPlasmid Typing (plasmid typing of A. baumannii using abricate with the custom A. baumannii plasmid typing database)
- Pseudomonas aeruginosa: Pasty (tool to identify the serogroup of P. aeruginosa isolates)
- Shigella spp.: ShigaTyper (tool designed to determine Shigella serotype), ShigEiFinder (tool that is used to identify differentiate Shigella/EIEC using cluster-specific genes and identify the serotype using O-antigen/H-antigen genes), SonneiTyper (tool to identify input genomes as S. sonnei, assign those identified as S. sonnei to hierarchical genotypes based on detection of single nucleotide variants)
- Streptococcus pneuomniae: GPS unified workflow (PopPUNK (tool for in silico Penicillin Binding Protein (PBP) typing), SeroBA (tool for S. pneumoniae serotyping), PBPTyper with Global Pneumococcal Sequencing (GPS) database v6 for GPS Cluster assignment
QC and read processing modules added:
- Option to quantify secondary genus abundance using the MIDAS
- Option to utilize fastp rather than trimmomatic for read processing
- Option to utilize bakta rather than prokka for genome annotation
- Option to perform a QC check--i.e. determine QC Pass or QC Alert based on user-defined thresholds for multiple QC metrics
Column output updates:
genome_length
renamed toassembly_length
est_coverage
renamed toest_coverage_raw
(est_coverage_clean
column output added)
Core Gene SNP Workflow
The Core_Gene_SNP workflow is a flexible workflow intended for core gene alignment and phylogenetic analysis of a set of samples. The workflow takes in gene sequence data in GFF3 format from a set of samples. It first produces a pangenome summary using Pirate, which clusters genes within the sample set into orthologous gene families. By default, the workflow also instructs Pirate to produce both core genome and pangenome alignments.
The workflow subsequently triggers the generation of a SNP distance matrix and a phylogenetic tree using the core genome alignment via snp-dists and iqtree, respectively. Optionally, the workflow will also run this analysis using the pangenome alignment.
Other Modifications
- AMRFinderPlus task modifications:
- Default docker image updated to v3.10.26 and output
database version
- Drug class outputs brought to Terra data table
- Default docker image updated to v3.10.26 and output
- kSNP3 task/workflow modifications
- tree Newick file output extensions changed to
.nwk
- tree Newick file output extensions changed to
- Gambit docker task modified to utilize GAMBIT v0.5.0
- TS_MLST task modified to utilize MLST v2.23.0
New Documentation
Detailed documentation has been created for all workflows in the PHBG v1.1.0 repository.
What's Changed
- amrfinderplus task updates by @kapsakcj in #137
- Add Streptococcus pneumoniae subworkflow by @kapsakcj in #141
- Adds subworkflow for A. baumannii, includes Kaptive task (K & O typing) by @erikwolfsohn in #138
- Kleborate updates by @kapsakcj in #148
- kSNP3 task edit: changed file suffix from .tree to .nwk by @kapsakcj in #146
- Adds drug class output to TheiaProk by @michellescribner in #145
- update gambit task to v0.5.0 docker image by @michellescribner in #151
- Spneumo subworkflow enhancements: docker & GPS db version outputs and upgrade default pbptyper docker by @kapsakcj in #149
- Add midas as optional TheiaProk task by @michellescribner in #159
- Add option to hide point mutations from AMRFinderPlus output & update default amrfinderplus docker image by @michellescribner in #158
- Fix gambit parsing for next_taxon_rank is None by @michellescribner in #161
- add task for Abaum plasmid typing to TheiaProk_Illumina_PE and SE by @kapsakcj in #160
- Add option to kSNP3 to create maximum likelihood and neighbor joining trees by @michellescribner in #166
- update default mlst docker image to staphb/mlst:2.23.0 & fix CI env by @kapsakcj in #163
- Modify midas parsing by @michellescribner in #172
- Adds shigella subworkflow by @kapsakcj in #162
- Adds bakta task by @michellescribner in #170
- Add fastp task, modify read trimming parameters, and modify estimated coverage calculations by @michellescribner in #169
- Fja tbprofiler update by @frankambrosio3 in #174
- Add Core_Gene_SNP workflow by @michellescribner in #178
- adds p. aeruginosa subworkflow and pasty for serogrouping by @jrotieno in #179
- update pasty_docker default; add
pasty_comment
string output for PE and SE wfs by @kapsakcj in #181 - Revert default read trimming parameters to v1.0 by @michellescribner in #184
- Eld docs dev by @emmadoughty in #180
- Fixed printf to convert sci notation to integers by @frankambrosio3 in #177
- Add qc_check task to TheiaProk by @michellescribner in #182
- Generate gene_presence_absence.csv with pirate task by @HNHalstead in #185
- MLST novel alleles by @emmadoughty in #186
- Export Taxon Table Fix and others by @sage-wright in #188
- fix file extension awareness cg_pipeline by @michellescribner in #189
New Contributors
- @jrotieno made their first contribution in #179
- @emmadoughty made their first contribution in #180
- @HNHalstead made their first contribution in #185
Full Changelog: v1.0.0...1.1.0
v1.0.0
PHBG v1.0.0 Release Notes
This major release introduces a stable and validated version of the TheiaProk workflow series.
This release also offers two new workflows (TheiaProk_Illumina_SE
and RASUSA
) and multiple organism-agnostic modules described in more detail below.
About TheiaProk
The TheiaProk workflows are for assembly and characterization of prokaryotic genomes, principally bacteria. All input reads go through steps in the core workflow for read trimming and assembly, quality assessment, species identification, and resistance gene identification. Sub-workflows further characterize some genomes, with activation of these processes dependent on the taxa identified.
Currently, TheiaProk has two forms: for Illumina paired-end sequencing data (TheiaProk_Illumina_PE
), and for Illumina single-end sequencing data (TheiaProk_Illumina_SE
). Future plans include development of workflows for alternative sequence data types, like Oxford Nanopore.
The following information describes the changes since the v0.6.0 version.
New modules to the TheiaProk workflows
The following modules are new additions to the core sample characterization performed on all organisms after genome assembly. While most of these are run by default, several modules can be enabled through the usage of a Boolean input parameter. More information about each tool can be found by clicking on the associated links.
- Gene Typing
- PlasmidFinder - identifies plasmid replicon genes in total or partial sequenced isolates of bacteria (default)
- Prokka - annotates bacterial genomes quickly and produces standards-compliant output files (default)
- ResFinder - identifies acquired antimicrobial resistance genes in total or partial sequenced isolates of bacteria (optional; set
call_resfinder
totrue
to enable)
- Quality Control
- BUSCO- “provide[s] a quantitative assessment of the completeness in terms of expected gene content” (default)
- Mummer ANI - calculates Average Nucleotide Identify (ANI) using MUMmer and an ANI calculation script from Lee Katz (optional; set
call_ani
totrue
to enable)
SKESA as default assembler
Through extensive validation and analysis, we have made the decision to switch our default parameter from SPAdes to SKESA. We have observed that the more conservative assemblies generated with SKESA led to greater concordance with known epidemiological relationships downstream while maintaining an ability to accurately characterize pathogen genomic data with respect to taxon prediction, serotyping, and AMR gene detection.
The SPAdes assembler can still be used through usage of an input variable (for TheiaProk_Illumina_PE
, set shovill_pe.assembler
to “spades”; for TheiaProk_Illumina_SE
, set shovill_se.assembler
to “spades”).
New workflows
- TheiaProk_Illumina_SE - this workflow is equivalent to TheiaProk_Illumina_PE but is intended for Illumina single-end sequencing data; all modules are the same, except when appropriate, single-end-specific versions and parameters are used.
- RASUSA - a workflow that will randomly subsample reads to a specified coverage using RASUSA.
Other changes
- GitHub Actions for automated testing and continuous integration were added to the PHBG repository!
- The
export_taxon_tables
task now can handle extra large fastq files. - Shovill parameters have been exposed so advanced users can select their own assemblers and customize assembly parameters to their heart’s content.
- kSNP3 distance matrices were previously completely unordered. These SNP matrices are now more ordered than they were before. These semi-ordered SNP matrices appear most often when multiple outbreak groups are included in a kSNP3 analysis. Future releases will include the addition of fully-ordered SNP matrices.
- The kSNP3 workflow now produces SNP distance matrices and phylogenetic trees generated using both pangenome and core genome analyses.
Log of PRs
- Fja export tt highmem dev by @frankambrosio3 in #119
- exposing shovill parameters by @sage-wright in #121
- Order SNP-Dists Matrix by @kevinlibuit in #125
- Add GitHub Actions to PHBG by @rpetit3 in #123
- Add RASUSA Task and Workflow File by @kevinlibuit in #122
- add TheiaProk_Illumina_SE workflow by @sage-wright in #124
- typo fix by @frankambrosio3 in #128
- adds ANI to theiaprok_illumina_pe wf by @kapsakcj in #126
- Add BUSCO by @sage-wright in #127
- Adds resfinder to theiaprok pe and se by @michellescribner in #130
- Adds prokka, plasmidfinder, ksnp3 core, wf_pangenome by @michellescribner in #129
- Change default assembler to skesa by @sage-wright in #132
- remove compression from alignment files by @michellescribner in #134
Full Changelog: v0.6.0...v1.0.0
v0.6.0
What's Changed
- Gambit output parsing correction by @sage-wright in #80
- Narrow TBProfiler to MTB only by @kevinlibuit in #91
- Add genotyphi to TheiaProk_Illumina_PE workflow by @kapsakcj in #98
- Remove fasta extension restriction by @kevinlibuit in #105
- Adds legsta to TheiaProk_Illumina_PE by @michellescribner in #106
- Legsta fix SBT output value for samples with no SBT predicted by @michellescribner in #110
- Add disk_size attribute to kSNP3 task by @kevinlibuit in #100
- Enclose Terra billing project and workspace arguments in double quotes by @michellescribner in #109
Full Changelog: v0.5.0...v0.6.0
v0.5.0
What's Changed
- Add 2 kraken2 workflows (Single End & Paired End) by @rpetit3 in #70
- New NCBI-AMRFinderPlus workflow and integration in TheiaProk_Illumina_PE wf by @kapsakcj in #65
- TheiaProk_Illumina_PE workflow - replaced abricate task with NCBI-AMRFinderPlus for AMR gene detection
- Fixed integer math in read_screen task by @sage-wright , also in #65
mlst
task updated to 2.22.0 (default docker image updated tostaphb/mlst:2.22.0
)- updated gambit_query workflow with updated task (gambit v0.4.0) by @kevinlibuit
- export_taxon_tables feature now includes NCBI-AMRFinderPlus outputs
Full Changelog: v0.4.0...v0.5.0
v0.4.0
This release adds MLST profiling to the TheiaProk_Illumina_PE workflow.
- MLST profiling is performed using @tseemann's mlst workflow
Additional updates to TheiaProk_Illumina_PE:
- Data screening task added to avoid workflow failures caused by low-quality input read data
- QC metrics adjusted for WGS bacterial data
- Capture of n50 from Quast report (Thanks, @erikwolfsohn!)
- Exposure of minimum percent length and coverage parameters exposed in Abricate task
- Replacing the Quast assembly length with the Mash estimated genome size for the
cg-pipeline
read coverage calculations - Allow for additional fields of metadata to be exported to taxon tables:
collection_date
,originating_lab
,city
,county
,zip
v0.3.0
This release renames the Apollo_Illumina_PE workflow to TheiaProk_Illumina_PE workflow restructures the PHBG task directory
The TheiaProk workflow was developed to replace Apollo workflows for bacterial genomic characterization. TheiaProk is based off of @rpetit3's Bactopia and its Merlin subworkflow and differs from the original Apollo workflows in its organism-typing subworkflow merlin_magic.
This subworkflow triggers organism typing based on gambit taxon assignments for each sample, e.g. serotyping via SeroTypeFinder will be performed for samples with an Escherichia gambit taxon assignment.
TheiaProk organism typing will be performed for the following organisms using the listed bioinformatics software:
- Eschericia spp.: serotypefinder & ectyper
- Listeria spp.: lissero
- Salmonella spp.: sistr & seqsero2
- Klebsiella spp.: kleborate
- Mycobacterium spp.: tbprofiler
TheiaProk_Illumina_PE will also perform AMR gene detection using abricate against the NCBI AMRFinderPlus database.
Additionally, the PHBG directory structure was reformatted for ease of use and readability.
v0.2
Release to add the Kleborate and SerotypeFinder workflows
- Available as tasks within the
task_taxon_id.wdl
file as well as stand-alone single-task workflows; both available on Terra via DockStore
Other Changes:
- Version and analysis date captured for every workflow
- Shovill task modified to include optional minimum contig length (default set to 200bp); this default setting is utilized in the
Apollo_Illumina_PE
workflow - White space inconsistencies addressed
Apollo_Illumina_PE
output name changes:- predicted_genus → gamibit_genus
- predicted_species → gamibit_species
- Validation files directory created for local testing