Merge branch 'main' into smw-ont-read-trim-dev

theiagen · Jan 2, 2025 · c687c6e · c687c6e
2 parents 613a712 + 604cdf2
commit c687c6e
Show file tree

Hide file tree

Showing 18 changed files with 214 additions and 82 deletions.
diff --git a/docs/assets/files/TheiaCoV_Illumina_PE_qc_check_template.tsv b/docs/assets/files/TheiaCoV_Illumina_PE_qc_check_template.tsv
@@ -0,0 +1,6 @@
+taxon	num_reads_raw1	num_reads_raw2	num_reads_clean1	num_reads_clean2	kraken_human	kraken_human_dehosted	meanbaseq_trim	assembly_mean_coverage	number_N	number_Degenerate	assembly_length_unambiguous_min	assembly_length_unambiguous_max	percent_reference_coverage	vadr_num_alerts
+sars-cov-2	100000	100000	100000	100000	20	20	30	100	5000	1	25000	30000	83	0
+HIV	100000	100000	100000	100000	20	20	30	100						
+WNV	100000	100000	100000	100000	20	20	30	100						
+MPXV	100000	100000	100000	100000	20	20	30	100						
+flu	100000	100000	100000	100000	20	20	30	100						
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -173,6 +173,7 @@
 table {
   overflow-y: scroll;
   max-height: 500px;
+  max-width: 100vw;
   display: block;
 }
 th {
@@ -183,6 +184,7 @@ th {
 }
 td {
   word-break: break-all;
+  overflow-wrap: anywhere;
 }
 /* Base styles for the search box */
 div.searchable-table input.table-search-input {

diff --git a/docs/workflows/genomic_characterization/theiacov.md b/docs/workflows/genomic_characterization/theiacov.md
@@ -55,15 +55,15 @@ Additionally, the **TheiaCoV_FASTA_Batch** workflow is available to process seve
 
 ### Supported Organisms
 
-These workflows currently support the following organisms:
+These workflows currently support the following organisms. The first option in the list (bolded) is what our workflows use as the _standardized_ organism name:
 
-- **SARS-CoV-2** (`"sars-cov-2"`, `"SARS-CoV-2"`) - ==_default organism input_==
-- **Monkeypox virus** (`"MPXV"`, `"mpox"`, `"monkeypox"`, `"Monkeypox virus"`, `"Mpox"`)
-- **Human Immunodeficiency Virus** (`"HIV"`)
-- **West Nile Virus** (`"WNV"`, `"wnv"`, `"West Nile virus"`)
-- **Influenza** (`"flu"`, `"influenza"`, `"Flu"`, `"Influenza"`)
-- **RSV-A** (`"rsv_a"`, `"rsv-a"`, `"RSV-A"`, `"RSV_A"`)
-- **RSV-B** (`"rsv_b"`, `"rsv-b"`, `"RSV-B"`, `"RSV_B"`)
+- **SARS-CoV-2** (**`"sars-cov-2"`**, `"SARS-CoV-2"`) - ==_default organism input_==
+- **Monkeypox virus** (**`"MPXV"`**, `"mpox"`, `"monkeypox"`, `"Monkeypox virus"`, `"Mpox"`)
+- **Human Immunodeficiency Virus** (**`"HIV"`**)
+- **West Nile Virus** (**`"WNV"`**, `"wnv"`, `"West Nile virus"`)
+- **Influenza** (**`"flu"`**, `"influenza"`, `"Flu"`, `"Influenza"`)
+- **RSV-A** (**`"rsv_a"`**, `"rsv-a"`, `"RSV-A"`, `"RSV_A"`)
+- **RSV-B** (**`"rsv_b"`**, `"rsv-b"`, `"RSV-B"`, `"RSV_B"`)
 
 The compatibility of each workflow with each pathogen is shown below:
 
@@ -170,7 +170,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
 | flu_track | **genoflu_cpu** | Int | Number of CPUs to allocate to the task | 1 | Optional | FASTA, ONT, PE | flu |
 | flu_track | **genoflu_cross_reference** | File | An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py | | Optional | FASTA, ONT, PE | |
 | flu_track | **genoflu_disk_size** | Int | Amount of storage (in GB) to allocate to the task | 25 | Optional | FASTA, ONT, PE | |
-| flu_track | **genoflu_docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.03 | Optional | FASTA, ONT, PE | |
+| flu_track | **genoflu_docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.05 | Optional | FASTA, ONT, PE | |
 | flu_track | **genoflu_memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional | FASTA, ONT, PE | |
 | flu_track | **irma_cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | ONT, PE | flu |
 | flu_track | **irma_disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | ONT, PE | flu |
@@ -837,6 +837,30 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
         | Software Documentation | [NCBI Scrub](<https://github.com/ncbi/sra-human-scrubber/blob/master/README.md>)<br>[Artic pipeline](https://artic.readthedocs.io/en/latest/?badge=latest)<br>[Kraken2](https://github.com/DerrickWood/kraken2/wiki) |
         | Original Publication(s) | [STAT: a fast, scalable, MinHash-based *k*-mer tool to assess Sequence Read Archive next-generation sequence submissions](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02490-0)<br>[Improved metagenomic analysis with Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)  |
 
+??? task "`qc_check`: Check QC Metrics Against User-Defined Thresholds (optional)"
+
+    The `qc_check` task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a `qc_check_table` TSV file. If all QC metrics meet the threshold, the `qc_check` output variable will read `QC_PASS`. Otherwise, the output will read `QC_NA` if the task could not proceed or `QC_ALERT` followed by a string indicating what metric failed.
+
+    The `qc_check` task applies quality thresholds according to the specified organism, which should match the _standardized_ `organism` input in the TheiaCoV workflows.
+
+    ??? toggle "Formatting the _qc_check_table.tsv_"
+
+        - The first column of the qc_check_table lists the `organism` that the task will assess and the header of this column must be "**taxon**".
+        - Each subsequent column indicates a QC metric and lists a threshold for each organism that will be checked. **The column names must exactly match expected values, so we highly recommend copy and pasting the header from the template file below as a starting place.**
+
+    ??? toggle "Template _qc_check_table.tsv_ files"    
+        
+        - TheiaCoV_Illumina_PE: [TheiaCoV_Illumina_PE_qc_check_template.tsv](../../assets/files/TheiaCoV_Illumina_PE_qc_check_template.tsv)
+
+        !!! warning "Example Purposes Only"
+            The QC threshold values shown in the file above are for example purposes only and should not be presumed to be sufficient for every dataset.
+
+    !!! techdetails "`qc_check` Technical Details"
+
+        |  | Links |
+        | --- | --- |
+        | Task | [task_qc_check.wdl](https://github.com/theiagen/public_health_bioinformatiocs/blob/main/tasks/quality_control/comparisons/task_qc_check.wdl) |
+
 #### Assembly tasks
 
 !!! tip ""

diff --git a/docs/workflows/genomic_characterization/theiaeuk.md b/docs/workflows/genomic_characterization/theiaeuk.md
@@ -4,7 +4,7 @@
 
 | **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibliity** | **Workflow Level** |
 |---|---|---|---|---|
-| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Mycotics](../../workflows_overview/workflows_kingdom.md/#mycotics) | PHB v2.3.0 | Yes | Sample-level |
+| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Mycotics](../../workflows_overview/workflows_kingdom.md/#mycotics) | PHB vX.X.X | Yes | Sample-level |
 
 ## TheiaEuk Workflows
 
@@ -407,7 +407,7 @@ All input reads are processed through "core tasks" in the TheiaEuk workflows. Th
         | Software Documentation | https://busco.ezlab.org/ |
         | Orginal publication | [BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs](https://academic.oup.com/bioinformatics/article/31/19/3210/211866) |
 
-??? task "`QC_check`: Check QC Metrics Against User-Defined Thresholds (optional)"
+??? task "`qc_check`: Check QC Metrics Against User-Defined Thresholds (optional)"
 
     The `qc_check` task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a `qc_check_table` .tsv file. If all QC metrics meet the threshold, the `qc_check` output variable will read `QC_PASS`. Otherwise, the output will read `QC_NA` if the task could not proceed or `QC_ALERT` followed by a string indicating what metric failed.
 
@@ -598,64 +598,100 @@ All input reads are processed through "core tasks" in the TheiaEuk workflows. Th
 
 | **Variable** | **Type** | **Description** |
 |---|---|---|
+| assembly_fasta | File | _De novo_ genome assembly in FASTA format |
+| assembly_length | Int | Length of assembly (total number of nucleotides) as determined by QUAST |
+| bbduk_docker| String | BBDuk docker image used |
+| busco_database | String | BUSCO database used |
+| busco_docker | String | BUSCO docker image used |
+| busco_report | File | A plain text summary of the results in BUSCO notation |
+| busco_results | String | BUSCO results (see above for explanation of BUSCO notation) |
+| busco_version | String | BUSCO software version used |
 | cg_pipeline_docker | String | Docker file used for running CG-Pipeline on cleaned reads |
 | cg_pipeline_report | File | TSV file of read metrics from raw reads, including average read length, number of reads, and estimated genome coverage |
-| est_coverage_clean | Float | Estimated coverage calculated from   clean reads and genome length |
-| est_coverage_raw | Float | Estimated coverage calculated from  raw reads and genome length |
+| cladetyper_annotated_reference | String | The annotated reference file for the identified clade, "None" if no clade was identified |
+| cladetyper_clade | String | The clade assigned to the input assembly |
+| cladetyper_docker_image | String | The Docker container used for the task |
+| cladetyper_gambit_version | String | The version of GAMBIT used for the analysis |
+| combined_mean_q_clean | Float | Mean quality score for the combined clean reads |
+| combined_mean_q_raw | Float | Mean quality score for the combined raw reads |
+| combined_mean_readlength_clean | Float | Mean read length for the combined clean reads |
+| combined_mean_readlength_raw | Float | Mean read length for the combined raw reads |
+| contigs_fastg | File | Assembly graph if megahit used for genome assembly |
+| contigs_gfa | File | Assembly graph if spades used for genome assembly |
+| contigs_lastgraph | File | Assembly graph if velvet used for genome assembly |
+| est_coverage_clean | Float | Estimated coverage calculated from clean reads and genome length |
+| est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length |
+| fastp_html_report | File | The HTML report made with fastp |
+| fastp_version | String | Version of fastp software used |
 | fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length |
 | fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length |
+ fastq_scan_num_reads_clean_pairs | String | Number of read pairs after cleaning as calculated by fastq_scan |
+| fastq_scan_num_reads_clean1 | Int | Number of forward reads after cleaning as calculated by fastq_scan |
+| fastq_scan_num_reads_clean2 | Int | Number of reverse reads after cleaning as calculated by fastq_scan |
+| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs calculated by fastq_scan |
+| fastq_scan_num_reads_raw1 | Int | Number of input forward reads calculated by fastq_scan |
+| fastq_scan_num_reads_raw2 | Int | Number of input reverse reads calculated by fastq_scan |
+| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs calculated by fastq_scan |
 | fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length |
 | fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length |
-| r1_mean_q_clean | Float | Mean quality score of clean forward reads |
-| r1_mean_q_raw | Float | Mean quality score of raw forward reads |
-| r2_mean_q_clean | Float | Mean quality score of clean reverse reads |
-| r2_mean_q_raw | Float | Mean quality score of raw reverse reads |
 | fastq_scan_version | String | Version of fastq-scan software used |
+| fastqc_clean1_html | File | Graphical visualization of clean forward read quality from fastqc to open in an internet browser |
+| fastqc_clean2_html | File | Graphical visualization of clean reverse read quality from fastqc to open in an internet browser |
+| fastqc_docker | String | Docker container used with fastqc |
+| fastqc_num_reads_clean1 | Int | Number of forward reads after cleaning by fastqc |
+| fastqc_num_reads_clean2 | Int | Number of reverse reads after cleaning by fastqc |
+| fastqc_num_reads_clean_pairs | String | Number of read pairs after cleaning by fastqc |
+| fastqc_num_reads_raw1 | Int | Number of input reverse reads by fastqc |
+| fastqc_num_reads_raw2 | Int | Number of input reverse reads by fastqc |
+| fastqc_num_reads_raw_pairs | String | Number of input read pairs by fastqc |
+| fastqc_raw1_html | File | Graphical visualization of raw forward read quality from fastqc to open in an internet browser |
+| fastqc_raw2_html | File | Graphical visualization of raw reverse read qualityfrom fastqc to open in an internet browser |
+| fastqc_version | String | Version of fastqc software used |
 | gambit_closest_genomes | File | CSV file listing genomes in the GAMBIT database that are most similar to the query assembly |
 | gambit_db_version | String | Version of GAMBIT used |
 | gambit_docker | String | GAMBIT docker file used |
 | gambit_predicted_taxon | String | Taxon predicted by GAMBIT |
 | gambit_predicted_taxon_rank | String | Taxon rank of GAMBIT taxon prediction |
 | gambit_report | File | GAMBIT report in a machine-readable format |
 | gambit_version | String | Version of GAMBIT software used |
-| assembly_length | Int | Length of assembly (total contig length) as determined by QUAST |
 | n50_value | Int | N50 of assembly calculated by QUAST |
 | number_contigs | Int | Total number of contigs in assembly |
+| qc_check | String | A string that indicates whether or not the sample passes a set of pre-determined and user-provided QC thresholds |
+| qc_standard | File | The user-provided file that contains the QC thresholds used for the QC check |
+| quast_gc_percent | Float | The GC percent of your sample |
 | quast_report | File | TSV report from QUAST |
 | quast_version | String | Software version of QUAST used |
+| r1_mean_q_raw | Float | Mean quality score of raw forward reads |
+| r1_mean_readlength_raw | Float | Mean read length of raw forward reads |
+| r2_mean_q_raw | Float | Mean quality score of raw reverse reads |
+| r2_mean_readlength_clean | Float | Mean read length of clean reverse reads |
 | rasusa_version | String | Version of rasusa used |
-| read1_subsampled | File | Subsampled read1 file |
-| read2_subsampled | File | Subsampled read2 file |
-| bbduk_docker | String | BBDuk docker image used  |
-| fastp_version | String | Version of fastp software used |
 | read1_clean | File | Clean forward reads file |
+| read1_subsampled | File | Subsampled read1 file |
 | read2_clean | File | Clean reverse reads file |
-| num_reads_clean_pairs | String | Number of read pairs after cleaning |
-| num_reads_clean1 | Int | Number of forward reads after cleaning |
-| num_reads_clean2 | Int | Number of reverse reads after cleaning |
-| num_reads_raw_pairs | String | Number of input read pairs |
-| num_reads_raw1 | Int | Number of input forward reads |
-| num_reads_raw2 | Int | Number of input reverse reads |
-| trimmomatic_version | String | Version of trimmomatic used |
-| clean_read_screen | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason for failure |
-| raw_read_screen | String | PASS or FAIL result from raw read screening; FAIL accompanied by thereason for failure |
-| assembly_fasta | File | <https://github.com/tseemann/shovill#contigsfa> |
-| contigs_fastg | File | Assembly graph if megahit used for genome assembly |
-| contigs_gfa | File | Assembly graph if spades used for genome assembly |
-| contigs_lastgraph | File | Assembly graph if velvet used for genome assembly |
+| read2_subsampled | File | Subsampled read2 file |
+| read_screen_clean | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason for failure | ONT, PE, SE |
+| read_screen_raw | String | PASS or FAIL result from raw read screening; FAIL accompanied by thereason for failure |
+| seq_platform | String | Sequencing platform input by the user |
 | shovill_pe_version | String | Shovill version used |
-| theiaeuk_snippy_variants_bam | File | BAM file produced by the snippy module |
+| theiaeuk_illumina_pe_analysis_date | String | Date of TheiaEuk PE workflow execution |
+| theiaeuk_illumina_pe_version | String | TheiaEuk PE workflow version used |
+| theiaeuk_snippy_variants_bai | String | BAI file produced by the snippy module |
+| theiaeuk_snippy_variants_bam | String | BAM file produced by the snippy module |
+| theiaeuk_snippy_variants_coverage_tsv | String | TSV file containing coverage information for each base in the reference genome |
 | theiaeuk_snippy_variants_gene_query_results | File | File containing all lines from variants file matching gene query terms |
 | theiaeuk_snippy_variants_hits | String | String of all variant file entries matching gene query term |
+| theiaeuk_snippy_variants_num_reads_aligned | String | Number of reads aligned by snippy |
+| theiaeuk_snippy_variants_num_variants | Int | Number of variants detected by snippy |
 | theiaeuk_snippy_variants_outdir_tarball | File | Tar compressed file containing full snippy output directory |
+| theiaeuk_snippy_variants_percent_ref_coverage | String | Percent of reference genome covered by snippy |
 | theiaeuk_snippy_variants_query | String | The gene query term(s) used to search variant |
 | theiaeuk_snippy_variants_query_check | String | Were the gene query terms present in the refence annotated genome file |
 | theiaeuk_snippy_variants_reference_genome | File | The reference genome used in the alignment and variant calling |
 | theiaeuk_snippy_variants_results | File | The variants file produced by snippy |
 | theiaeuk_snippy_variants_summary | File | A file summarizing the variants detected by snippy |
 | theiaeuk_snippy_variants_version | String | The version of the snippy_variants module being used |
-| seq_platform | String | Sequencing platform inout by the user |
-| theiaeuk_illumina_pe_analysis_date | String | Date of TheiaProk workflow execution |
-| theiaeuk_illumina_pe_version | String | TheiaProk workflow version used |
+| trimmomatic_docker | String | Docker image used for trimmomatic |
+| trimmomatic_version | String | Version of trimmomatic used |
 
 </div>
diff --git a/docs/workflows/genomic_characterization/theiaprok.md b/docs/workflows/genomic_characterization/theiaprok.md
@@ -1113,7 +1113,7 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al
         | Software Documentation | https://bitbucket.org/genomicepidemiology/plasmidfinder/src/master/ |
         | Original Publication(s) | [In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4068535/) |
 
-??? task "**`QC_check`: Check QC Metrics Against User-Defined Thresholds (optional)**"
+??? task "**`qc_check`: Check QC Metrics Against User-Defined Thresholds (optional)**"
 
     The `qc_check` task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a `qc_check_table` .tsv file. If all QC metrics meet the threshold, the `qc_check` output variable will read `QC_PASS`. Otherwise, the output will read `QC_NA` if the task could not proceed or `QC_ALERT` followed by a string indicating what metric failed.
 

diff --git a/docs/workflows/phylogenetic_construction/augur.md b/docs/workflows/phylogenetic_construction/augur.md
@@ -174,7 +174,7 @@ The Augur_PHB workflow takes in a ***set*** of SARS-CoV-2 (or any other viral
 
 This workflow runs on the set level. Please note that for every task, runtime parameters are modifiable (cpu, disk_size, docker, and memory); most of these values have been excluded from the table below for convenience.
 
-<div class="searchable-table" markdown="1" width=100vw>
+<div class="searchable-table" markdown="1"> 
 
 | **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
 |---|---|---|---|---|---|