Skip to content

Commit

Permalink
Merge branch 'main' into smw-ont-read-trim-dev
Browse files Browse the repository at this point in the history
  • Loading branch information
sage-wright authored Jan 2, 2025
2 parents 613a712 + 604cdf2 commit c687c6e
Show file tree
Hide file tree
Showing 18 changed files with 214 additions and 82 deletions.
6 changes: 6 additions & 0 deletions docs/assets/files/TheiaCoV_Illumina_PE_qc_check_template.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
taxon num_reads_raw1 num_reads_raw2 num_reads_clean1 num_reads_clean2 kraken_human kraken_human_dehosted meanbaseq_trim assembly_mean_coverage number_N number_Degenerate assembly_length_unambiguous_min assembly_length_unambiguous_max percent_reference_coverage vadr_num_alerts
sars-cov-2 100000 100000 100000 100000 20 20 30 100 5000 1 25000 30000 83 0
HIV 100000 100000 100000 100000 20 20 30 100
WNV 100000 100000 100000 100000 20 20 30 100
MPXV 100000 100000 100000 100000 20 20 30 100
flu 100000 100000 100000 100000 20 20 30 100
2 changes: 2 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@
table {
overflow-y: scroll;
max-height: 500px;
max-width: 100vw;
display: block;
}
th {
Expand All @@ -183,6 +184,7 @@ th {
}
td {
word-break: break-all;
overflow-wrap: anywhere;
}
/* Base styles for the search box */
div.searchable-table input.table-search-input {
Expand Down
42 changes: 33 additions & 9 deletions docs/workflows/genomic_characterization/theiacov.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,15 @@ Additionally, the **TheiaCoV_FASTA_Batch** workflow is available to process seve

### Supported Organisms

These workflows currently support the following organisms:
These workflows currently support the following organisms. The first option in the list (bolded) is what our workflows use as the _standardized_ organism name:

- **SARS-CoV-2** (`"sars-cov-2"`, `"SARS-CoV-2"`) - ==_default organism input_==
- **Monkeypox virus** (`"MPXV"`, `"mpox"`, `"monkeypox"`, `"Monkeypox virus"`, `"Mpox"`)
- **Human Immunodeficiency Virus** (`"HIV"`)
- **West Nile Virus** (`"WNV"`, `"wnv"`, `"West Nile virus"`)
- **Influenza** (`"flu"`, `"influenza"`, `"Flu"`, `"Influenza"`)
- **RSV-A** (`"rsv_a"`, `"rsv-a"`, `"RSV-A"`, `"RSV_A"`)
- **RSV-B** (`"rsv_b"`, `"rsv-b"`, `"RSV-B"`, `"RSV_B"`)
- **SARS-CoV-2** (**`"sars-cov-2"`**, `"SARS-CoV-2"`) - ==_default organism input_==
- **Monkeypox virus** (**`"MPXV"`**, `"mpox"`, `"monkeypox"`, `"Monkeypox virus"`, `"Mpox"`)
- **Human Immunodeficiency Virus** (**`"HIV"`**)
- **West Nile Virus** (**`"WNV"`**, `"wnv"`, `"West Nile virus"`)
- **Influenza** (**`"flu"`**, `"influenza"`, `"Flu"`, `"Influenza"`)
- **RSV-A** (**`"rsv_a"`**, `"rsv-a"`, `"RSV-A"`, `"RSV_A"`)
- **RSV-B** (**`"rsv_b"`**, `"rsv-b"`, `"RSV-B"`, `"RSV_B"`)

The compatibility of each workflow with each pathogen is shown below:

Expand Down Expand Up @@ -170,7 +170,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| flu_track | **genoflu_cpu** | Int | Number of CPUs to allocate to the task | 1 | Optional | FASTA, ONT, PE | flu |
| flu_track | **genoflu_cross_reference** | File | An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py | | Optional | FASTA, ONT, PE | |
| flu_track | **genoflu_disk_size** | Int | Amount of storage (in GB) to allocate to the task | 25 | Optional | FASTA, ONT, PE | |
| flu_track | **genoflu_docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.03 | Optional | FASTA, ONT, PE | |
| flu_track | **genoflu_docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.05 | Optional | FASTA, ONT, PE | |
| flu_track | **genoflu_memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional | FASTA, ONT, PE | |
| flu_track | **irma_cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | ONT, PE | flu |
| flu_track | **irma_disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | ONT, PE | flu |
Expand Down Expand Up @@ -837,6 +837,30 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
| Software Documentation | [NCBI Scrub](<https://github.com/ncbi/sra-human-scrubber/blob/master/README.md>)<br>[Artic pipeline](https://artic.readthedocs.io/en/latest/?badge=latest)<br>[Kraken2](https://github.com/DerrickWood/kraken2/wiki) |
| Original Publication(s) | [STAT: a fast, scalable, MinHash-based *k*-mer tool to assess Sequence Read Archive next-generation sequence submissions](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02490-0)<br>[Improved metagenomic analysis with Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0) |

??? task "`qc_check`: Check QC Metrics Against User-Defined Thresholds (optional)"

The `qc_check` task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a `qc_check_table` TSV file. If all QC metrics meet the threshold, the `qc_check` output variable will read `QC_PASS`. Otherwise, the output will read `QC_NA` if the task could not proceed or `QC_ALERT` followed by a string indicating what metric failed.

The `qc_check` task applies quality thresholds according to the specified organism, which should match the _standardized_ `organism` input in the TheiaCoV workflows.

??? toggle "Formatting the _qc_check_table.tsv_"

- The first column of the qc_check_table lists the `organism` that the task will assess and the header of this column must be "**taxon**".
- Each subsequent column indicates a QC metric and lists a threshold for each organism that will be checked. **The column names must exactly match expected values, so we highly recommend copy and pasting the header from the template file below as a starting place.**

??? toggle "Template _qc_check_table.tsv_ files"
- TheiaCoV_Illumina_PE: [TheiaCoV_Illumina_PE_qc_check_template.tsv](../../assets/files/TheiaCoV_Illumina_PE_qc_check_template.tsv)

!!! warning "Example Purposes Only"
The QC threshold values shown in the file above are for example purposes only and should not be presumed to be sufficient for every dataset.

!!! techdetails "`qc_check` Technical Details"

| | Links |
| --- | --- |
| Task | [task_qc_check.wdl](https://github.com/theiagen/public_health_bioinformatiocs/blob/main/tasks/quality_control/comparisons/task_qc_check.wdl) |

#### Assembly tasks

!!! tip ""
Expand Down
96 changes: 66 additions & 30 deletions docs/workflows/genomic_characterization/theiaeuk.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibliity** | **Workflow Level** |
|---|---|---|---|---|
| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Mycotics](../../workflows_overview/workflows_kingdom.md/#mycotics) | PHB v2.3.0 | Yes | Sample-level |
| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Mycotics](../../workflows_overview/workflows_kingdom.md/#mycotics) | PHB vX.X.X | Yes | Sample-level |

## TheiaEuk Workflows

Expand Down Expand Up @@ -407,7 +407,7 @@ All input reads are processed through "core tasks" in the TheiaEuk workflows. Th
| Software Documentation | https://busco.ezlab.org/ |
| Orginal publication | [BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs](https://academic.oup.com/bioinformatics/article/31/19/3210/211866) |

??? task "`QC_check`: Check QC Metrics Against User-Defined Thresholds (optional)"
??? task "`qc_check`: Check QC Metrics Against User-Defined Thresholds (optional)"

The `qc_check` task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a `qc_check_table` .tsv file. If all QC metrics meet the threshold, the `qc_check` output variable will read `QC_PASS`. Otherwise, the output will read `QC_NA` if the task could not proceed or `QC_ALERT` followed by a string indicating what metric failed.

Expand Down Expand Up @@ -598,64 +598,100 @@ All input reads are processed through "core tasks" in the TheiaEuk workflows. Th

| **Variable** | **Type** | **Description** |
|---|---|---|
| assembly_fasta | File | _De novo_ genome assembly in FASTA format |
| assembly_length | Int | Length of assembly (total number of nucleotides) as determined by QUAST |
| bbduk_docker| String | BBDuk docker image used |
| busco_database | String | BUSCO database used |
| busco_docker | String | BUSCO docker image used |
| busco_report | File | A plain text summary of the results in BUSCO notation |
| busco_results | String | BUSCO results (see above for explanation of BUSCO notation) |
| busco_version | String | BUSCO software version used |
| cg_pipeline_docker | String | Docker file used for running CG-Pipeline on cleaned reads |
| cg_pipeline_report | File | TSV file of read metrics from raw reads, including average read length, number of reads, and estimated genome coverage |
| est_coverage_clean | Float | Estimated coverage calculated from clean reads and genome length |
| est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length |
| cladetyper_annotated_reference | String | The annotated reference file for the identified clade, "None" if no clade was identified |
| cladetyper_clade | String | The clade assigned to the input assembly |
| cladetyper_docker_image | String | The Docker container used for the task |
| cladetyper_gambit_version | String | The version of GAMBIT used for the analysis |
| combined_mean_q_clean | Float | Mean quality score for the combined clean reads |
| combined_mean_q_raw | Float | Mean quality score for the combined raw reads |
| combined_mean_readlength_clean | Float | Mean read length for the combined clean reads |
| combined_mean_readlength_raw | Float | Mean read length for the combined raw reads |
| contigs_fastg | File | Assembly graph if megahit used for genome assembly |
| contigs_gfa | File | Assembly graph if spades used for genome assembly |
| contigs_lastgraph | File | Assembly graph if velvet used for genome assembly |
| est_coverage_clean | Float | Estimated coverage calculated from clean reads and genome length |
| est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length |
| fastp_html_report | File | The HTML report made with fastp |
| fastp_version | String | Version of fastp software used |
| fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length |
| fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length |
fastq_scan_num_reads_clean_pairs | String | Number of read pairs after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean1 | Int | Number of forward reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean2 | Int | Number of reverse reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs calculated by fastq_scan |
| fastq_scan_num_reads_raw1 | Int | Number of input forward reads calculated by fastq_scan |
| fastq_scan_num_reads_raw2 | Int | Number of input reverse reads calculated by fastq_scan |
| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs calculated by fastq_scan |
| fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length |
| fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length |
| r1_mean_q_clean | Float | Mean quality score of clean forward reads |
| r1_mean_q_raw | Float | Mean quality score of raw forward reads |
| r2_mean_q_clean | Float | Mean quality score of clean reverse reads |
| r2_mean_q_raw | Float | Mean quality score of raw reverse reads |
| fastq_scan_version | String | Version of fastq-scan software used |
| fastqc_clean1_html | File | Graphical visualization of clean forward read quality from fastqc to open in an internet browser |
| fastqc_clean2_html | File | Graphical visualization of clean reverse read quality from fastqc to open in an internet browser |
| fastqc_docker | String | Docker container used with fastqc |
| fastqc_num_reads_clean1 | Int | Number of forward reads after cleaning by fastqc |
| fastqc_num_reads_clean2 | Int | Number of reverse reads after cleaning by fastqc |
| fastqc_num_reads_clean_pairs | String | Number of read pairs after cleaning by fastqc |
| fastqc_num_reads_raw1 | Int | Number of input reverse reads by fastqc |
| fastqc_num_reads_raw2 | Int | Number of input reverse reads by fastqc |
| fastqc_num_reads_raw_pairs | String | Number of input read pairs by fastqc |
| fastqc_raw1_html | File | Graphical visualization of raw forward read quality from fastqc to open in an internet browser |
| fastqc_raw2_html | File | Graphical visualization of raw reverse read qualityfrom fastqc to open in an internet browser |
| fastqc_version | String | Version of fastqc software used |
| gambit_closest_genomes | File | CSV file listing genomes in the GAMBIT database that are most similar to the query assembly |
| gambit_db_version | String | Version of GAMBIT used |
| gambit_docker | String | GAMBIT docker file used |
| gambit_predicted_taxon | String | Taxon predicted by GAMBIT |
| gambit_predicted_taxon_rank | String | Taxon rank of GAMBIT taxon prediction |
| gambit_report | File | GAMBIT report in a machine-readable format |
| gambit_version | String | Version of GAMBIT software used |
| assembly_length | Int | Length of assembly (total contig length) as determined by QUAST |
| n50_value | Int | N50 of assembly calculated by QUAST |
| number_contigs | Int | Total number of contigs in assembly |
| qc_check | String | A string that indicates whether or not the sample passes a set of pre-determined and user-provided QC thresholds |
| qc_standard | File | The user-provided file that contains the QC thresholds used for the QC check |
| quast_gc_percent | Float | The GC percent of your sample |
| quast_report | File | TSV report from QUAST |
| quast_version | String | Software version of QUAST used |
| r1_mean_q_raw | Float | Mean quality score of raw forward reads |
| r1_mean_readlength_raw | Float | Mean read length of raw forward reads |
| r2_mean_q_raw | Float | Mean quality score of raw reverse reads |
| r2_mean_readlength_clean | Float | Mean read length of clean reverse reads |
| rasusa_version | String | Version of rasusa used |
| read1_subsampled | File | Subsampled read1 file |
| read2_subsampled | File | Subsampled read2 file |
| bbduk_docker | String | BBDuk docker image used |
| fastp_version | String | Version of fastp software used |
| read1_clean | File | Clean forward reads file |
| read1_subsampled | File | Subsampled read1 file |
| read2_clean | File | Clean reverse reads file |
| num_reads_clean_pairs | String | Number of read pairs after cleaning |
| num_reads_clean1 | Int | Number of forward reads after cleaning |
| num_reads_clean2 | Int | Number of reverse reads after cleaning |
| num_reads_raw_pairs | String | Number of input read pairs |
| num_reads_raw1 | Int | Number of input forward reads |
| num_reads_raw2 | Int | Number of input reverse reads |
| trimmomatic_version | String | Version of trimmomatic used |
| clean_read_screen | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason for failure |
| raw_read_screen | String | PASS or FAIL result from raw read screening; FAIL accompanied by thereason for failure |
| assembly_fasta | File | <https://github.com/tseemann/shovill#contigsfa> |
| contigs_fastg | File | Assembly graph if megahit used for genome assembly |
| contigs_gfa | File | Assembly graph if spades used for genome assembly |
| contigs_lastgraph | File | Assembly graph if velvet used for genome assembly |
| read2_subsampled | File | Subsampled read2 file |
| read_screen_clean | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason for failure | ONT, PE, SE |
| read_screen_raw | String | PASS or FAIL result from raw read screening; FAIL accompanied by thereason for failure |
| seq_platform | String | Sequencing platform input by the user |
| shovill_pe_version | String | Shovill version used |
| theiaeuk_snippy_variants_bam | File | BAM file produced by the snippy module |
| theiaeuk_illumina_pe_analysis_date | String | Date of TheiaEuk PE workflow execution |
| theiaeuk_illumina_pe_version | String | TheiaEuk PE workflow version used |
| theiaeuk_snippy_variants_bai | String | BAI file produced by the snippy module |
| theiaeuk_snippy_variants_bam | String | BAM file produced by the snippy module |
| theiaeuk_snippy_variants_coverage_tsv | String | TSV file containing coverage information for each base in the reference genome |
| theiaeuk_snippy_variants_gene_query_results | File | File containing all lines from variants file matching gene query terms |
| theiaeuk_snippy_variants_hits | String | String of all variant file entries matching gene query term |
| theiaeuk_snippy_variants_num_reads_aligned | String | Number of reads aligned by snippy |
| theiaeuk_snippy_variants_num_variants | Int | Number of variants detected by snippy |
| theiaeuk_snippy_variants_outdir_tarball | File | Tar compressed file containing full snippy output directory |
| theiaeuk_snippy_variants_percent_ref_coverage | String | Percent of reference genome covered by snippy |
| theiaeuk_snippy_variants_query | String | The gene query term(s) used to search variant |
| theiaeuk_snippy_variants_query_check | String | Were the gene query terms present in the refence annotated genome file |
| theiaeuk_snippy_variants_reference_genome | File | The reference genome used in the alignment and variant calling |
| theiaeuk_snippy_variants_results | File | The variants file produced by snippy |
| theiaeuk_snippy_variants_summary | File | A file summarizing the variants detected by snippy |
| theiaeuk_snippy_variants_version | String | The version of the snippy_variants module being used |
| seq_platform | String | Sequencing platform inout by the user |
| theiaeuk_illumina_pe_analysis_date | String | Date of TheiaProk workflow execution |
| theiaeuk_illumina_pe_version | String | TheiaProk workflow version used |
| trimmomatic_docker | String | Docker image used for trimmomatic |
| trimmomatic_version | String | Version of trimmomatic used |

</div>
2 changes: 1 addition & 1 deletion docs/workflows/genomic_characterization/theiaprok.md
Original file line number Diff line number Diff line change
Expand Up @@ -1113,7 +1113,7 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al
| Software Documentation | https://bitbucket.org/genomicepidemiology/plasmidfinder/src/master/ |
| Original Publication(s) | [In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4068535/) |

??? task "**`QC_check`: Check QC Metrics Against User-Defined Thresholds (optional)**"
??? task "**`qc_check`: Check QC Metrics Against User-Defined Thresholds (optional)**"

The `qc_check` task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a `qc_check_table` .tsv file. If all QC metrics meet the threshold, the `qc_check` output variable will read `QC_PASS`. Otherwise, the output will read `QC_NA` if the task could not proceed or `QC_ALERT` followed by a string indicating what metric failed.

Expand Down
2 changes: 1 addition & 1 deletion docs/workflows/phylogenetic_construction/augur.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ The Augur_PHB workflow takes in a ***set*** of SARS-CoV-2 (or any other viral

This workflow runs on the set level. Please note that for every task, runtime parameters are modifiable (cpu, disk_size, docker, and memory); most of these values have been excluded from the table below for convenience.

<div class="searchable-table" markdown="1" width=100vw>
<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
Expand Down
Loading

0 comments on commit c687c6e

Please sign in to comment.