updated snapshots and test output checks

phac-nml · Oct 2, 2024 · 5da1919 · 5da1919
1 parent 020f327
commit 5da1919
Show file tree

Hide file tree

Showing 5 changed files with 311 additions and 50 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### `Changed`
 
-- Added RASUSA for downsampling of nanopore or pacbio data.
+- Added RASUSA for downsampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125)
 
 ### `Updated`
 

diff --git a/docs/usage/tool_params.md b/docs/usage/tool_params.md
@@ -15,15 +15,15 @@ Screens contigs for antimicrobial and virulence genes. If you wish to use a diff
     - singularity: Abricate singularity container
     - docker: Abricate docker container
     - **args**: Can be a string of additional command line arguments to pass to abricate
-    - report_tag: determines the name of the Abricate output in the final summary file. **Do no touch this unless doing pipeline development.**
-    - header_p: This field tells the report module that the Abricate output contains headers. **Do no touch this unless doing pipeline development.**
+    - report_tag: determines the name of the Abricate output in the final summary file. **Do not change this unless doing pipeline development.**
+    - header_p: This field tells the report module that the Abricate output contains headers. **Do not change this unless doing pipeline development.**
 
 ### Raw Read Metrics
 A custom Python script that gathers quality metrics for each fastq file.
 
 - raw_reads
     - high_precision: When set to true, floating point precision of values output are accurate down to very small decimal places. Recommended to leave this setting as false (use the standard floats), it is much faster and having such precise decimal places is not needed for this module.
-    - report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do no touch this unless doing pipeline development.**
+    - report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do not change this unless doing pipeline development.**
 
 ### Coreutils
 In cases where a process uses bash scripting only, Nextflow by default will utilize system binaries when they are available and no container is specified. For reproducibility, we have chosen to use containers in such cases. When a better container is available, you can direct the pipeline to use it via below commands:
@@ -47,12 +47,12 @@ Kat was previously used to estimate genome size, however at the time of writing
 Seqtk is used for both the sub-sampling of reads and conversion of fasta files to fastq files in mikrokondo. The usage of seqtk to convert a fasta to a fastq is needed in certain typing tools requiring reads as input (this was a design decision to keep the pipeline generalizable).
 
 - seqtk
-    - singularity: singularity container for seqtk
-    - docker: docker container for seqtk
+    - singularity: Singularity container for seqtk
+    - docker: Docker container for seqtk
     - seed: A seed value for sub-sampling
     - reads_ext: Extension of reads after sub-sampling, do not touch alter this unless doing pipeline development.
-    - assembly_fastq: Extension of the fastas after being converted to fastq files. Do no touch this unless doing pipeline development.
-    - report_tag: Name of seqtk data in the fi nal summary report. Do no touch this unless doing pipeline development.
+    - assembly_fastq: Extension of the fastas after being converted to fastq files. Do not change this unless doing pipeline development.
+    - report_tag: Name of seqtk data in the final summary report. Do not change this unless doing pipeline development.
 
 ### Rasusa
 For long read data Rasusa is used for down sampling as it take read length into consideration when down sampling.
@@ -61,16 +61,16 @@ For long read data Rasusa is used for down sampling as it take read length into
     - singularity: singularity container for rasusa
     - docker: docker container for rasusa
     - seed: A seed value for sub-sampling
-    - reads_ext: The extension of the generated fastq files. Do no touch this unless doing pipeline development.
+    - reads_ext: The extension of the generated fastq files. Do not change this unless doing pipeline development.
 
 ### FastP
 Fastp is fast and widely used program for gathering of read quality metrics, adapter trimming, read filtering and read trimming. FastP has extensive options for configuration which are detailed in their documentation, but sensible defaults have been set. **Adapter trimming in Fastp is performed using overlap analysis, however if you do not trust this you can specify the sequencing adapters used directly in the additional arguments for Fastp**.
 
 - fastp
     - singularity: singularity container for FastP
     - docker: docker container for FastP
-    - fastq_ext: extension of the output Fastp trimmed reads, do not touch this unless doing pipeline development.
-    - html_ext: Extension of the html report output by fastp, do no touch unless doing pipeline development.
+    - fastq_ext: extension of the output Fastp trimmed reads, Do not change this unless doing pipeline development.
+    - html_ext: Extension of the html report output by fastp, Do not touch unless doing pipeline development.
     - json_ext: Extension of json report output by FastP do not touch unless doing pipeline development.
     - report_tag: Title of FastP data in the summary report.
     - **average_quality_e**: If a read/read-pair quality is less than this value it is discarded. Can be set from the command line with `--fp_average_quality`.

diff --git a/subworkflows/local/clean_reads.nf b/subworkflows/local/clean_reads.nf
@@ -135,9 +135,9 @@ workflow QC_READS {
             log.info "Not down sampling ${it[0].id} as estimated sample depth is already below targeted depth of ${params.target_depth}."
         }
 
-        to_down_sample = reads_sample.sub_sample.branch { meta, reads, sample_frac ->
-            short_reads: !meta.single_end // Hybrid and short reads sets still go to seqtk
-            long_reads: meta.single_end
+        to_down_sample = reads_sample.sub_sample.branch { it ->
+            short_reads: !it[0].single_end
+            long_reads: true
         }
 
         // Short reads and hybrid reads sets get sampled with seqtk still.
@@ -178,7 +178,6 @@ workflow QC_READS {
     }
 
     mash_screen_out = MASH_SCREEN(ch_prepped_reads, params.mash.mash_sketch ? file(params.mash.mash_sketch) : error("--mash_sketch ${params.mash_sketch} is invalid"))
-
     versions = versions.mix(mash_screen_out.versions)
 
     // Determine if sample is metagenomic

diff --git a/tests/subworkflows/local/clean_reads/clean_reads.nf.test b/tests/subworkflows/local/clean_reads/clean_reads.nf.test
@@ -13,24 +13,25 @@ nextflow_workflow {
                 """
                 input[0] = Channel.of(
                     [
-                        [id: "SAMPlE1", hybrid:
-                        false, assembly: false,
+                        [id: "SAMPlE1",
+                        hybrid: false,
+                        sample: "SAMPLE1",
+                        assembly: false,
                         downsampled: false,
                         single_end: false,
                         merge: false],
                         [
                             file("$baseDir/tests/data/reads/campy-staph1.fq.gz"),
                             file("$baseDir/tests/data/reads/campy-staph2.fq.gz")
-                        ],
-                        "illumina"
+                        ]
                     ])
                 input[1] = "illumina"
                 """
             }
 
             params {
                 outdir = "results"
-
+                min_reads = 1
                 mash_sketch = "https://github.com/phac-nml/mikrokondo/raw/dev/tests/data/databases/campy-staph-ecoli.msh"
                 mh_min_kmer = 1
 
@@ -46,8 +47,24 @@ nextflow_workflow {
         }
 
         then {
+
             assert workflow.success
+            assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.R1.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.R2.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/MashSketches/SAMPlE1.mash.estimate.msh").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/MashScreen/SAMPlE1.mash.screen.reads.screen.screen").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").linesGzip.size() == 496
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").linesGzip.size() == 496
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").linesGzip.size() == 496
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").linesGzip.size() == 496
             snapshot(workflow.out).match()
+
         }
 
     }
@@ -61,23 +78,24 @@ nextflow_workflow {
                 """
                 input[0] = Channel.of(
                     [
-                        [id: "SAMPlE1", hybrid:
-                        false, assembly: false,
+                        [id: "SAMPlE1",
+                        hybrid: false,
+                        assembly: false,
+                        sample: "SAMPLE1",
                         downsampled: false,
                         single_end: true,
                         merge: false],
                         [
                             file("$baseDir/tests/data/reads/campy-staph1.fq.gz"),
-                        ],
-                        "nanopore"
+                        ]
                     ])
                 input[1] = "nanopore"
                 """
             }
 
             params {
                 outdir = "results"
-
+                min_reads = 1
                 mash_sketch = "https://github.com/phac-nml/mikrokondo/raw/dev/tests/data/databases/campy-staph-ecoli.msh"
                 mh_min_kmer = 1
 
@@ -94,7 +112,18 @@ nextflow_workflow {
 
         then {
             assert workflow.success
+            assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/MashSketches/SAMPlE1.mash.estimate.msh").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/MashScreen/SAMPlE1.mash.screen.reads.screen.screen").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").linesGzip.size() == 500
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").linesGzip.size() == 500
+            assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.trimmed.reads.fastq.gz").linesGzip.size() == 500
             snapshot(workflow.out).match()
+
         }
 
     }
@@ -108,7 +137,9 @@ nextflow_workflow {
                 """
                 input[0] = Channel.of(
                     [
-                        [id: "SAMPlE1", hybrid: false,
+                        [id: "SAMPlE1",
+                        hybrid: false,
+                        sample: "SAMPLE1",
                         assembly: false,
                         downsampled: false,
                         single_end: true,
@@ -139,9 +170,19 @@ nextflow_workflow {
         }
 
         then {
-            assert workflow.success
-            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/Rasusa/SAMPlE1.rasusa.sample.sampled.reads.fastq.gz").exists()
-            snapshot(workflow.out).match()
+                assert workflow.success
+                assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.sampled.reads.fastq.gz").exists()
+                assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").exists()
+                assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/Rasusa/SAMPlE1.rasusa.sample.sampled.reads.fastq.gz").exists()
+                assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").exists()
+                assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
+                assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
+                assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.sampled.reads.fastq.gz").linesGzip.size() == 5656
+                assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").linesGzip.size() == 16680
+                assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/Rasusa/SAMPlE1.rasusa.sample.sampled.reads.fastq.gz").linesGzip.size() == 5656
+                assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").linesGzip.size() == 16680
+                snapshot(workflow.out).match()
+
         }
 
     }
@@ -154,7 +195,9 @@ nextflow_workflow {
                 """
                 input[0] = Channel.of(
                     [
-                        [id: "SAMPlE1", hybrid: false,
+                        [id: "SAMPlE1",
+                        hybrid: false,
+                        sample: "SAMPLE1",
                         assembly: false,
                         downsampled: false,
                         single_end: false,
@@ -187,8 +230,25 @@ nextflow_workflow {
 
         then {
             assert workflow.success
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
+            assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
+            assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.SAMPlE1_R2.final.sampled.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").exists()
             assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R1.seqtk.sample.sampled.reads.fastq.gz").exists()
             assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R2.seqtk.sample.sampled.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").exists()
+            assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.SAMPlE1_R1.final.sampled.reads.fastq.gz").linesGzip.size() == 4860
+            assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.SAMPlE1_R2.final.sampled.reads.fastq.gz").linesGzip.size() == 4860
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").linesGzip.size() == 16680
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").linesGzip.size() == 16680
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R1.seqtk.sample.sampled.reads.fastq.gz").linesGzip.size() == 4860
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R2.seqtk.sample.sampled.reads.fastq.gz").linesGzip.size() == 4860
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").linesGzip.size() == 16680
+            assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").linesGzip.size() == 16680
             snapshot(workflow.out).match()
         }
     }