Skip to content

Commit

Permalink
updated snapshots and test output checks
Browse files Browse the repository at this point in the history
  • Loading branch information
mattheww95 committed Oct 2, 2024
1 parent 020f327 commit 5da1919
Show file tree
Hide file tree
Showing 5 changed files with 311 additions and 50 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Changed`

- Added RASUSA for downsampling of nanopore or pacbio data.
- Added RASUSA for downsampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125)

### `Updated`

Expand Down
20 changes: 10 additions & 10 deletions docs/usage/tool_params.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Screens contigs for antimicrobial and virulence genes. If you wish to use a diff
- singularity: Abricate singularity container
- docker: Abricate docker container
- **args**: Can be a string of additional command line arguments to pass to abricate
- report_tag: determines the name of the Abricate output in the final summary file. **Do no touch this unless doing pipeline development.**
- header_p: This field tells the report module that the Abricate output contains headers. **Do no touch this unless doing pipeline development.**
- report_tag: determines the name of the Abricate output in the final summary file. **Do not change this unless doing pipeline development.**
- header_p: This field tells the report module that the Abricate output contains headers. **Do not change this unless doing pipeline development.**

### Raw Read Metrics
A custom Python script that gathers quality metrics for each fastq file.

- raw_reads
- high_precision: When set to true, floating point precision of values output are accurate down to very small decimal places. Recommended to leave this setting as false (use the standard floats), it is much faster and having such precise decimal places is not needed for this module.
- report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do no touch this unless doing pipeline development.**
- report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do not change this unless doing pipeline development.**

### Coreutils
In cases where a process uses bash scripting only, Nextflow by default will utilize system binaries when they are available and no container is specified. For reproducibility, we have chosen to use containers in such cases. When a better container is available, you can direct the pipeline to use it via below commands:
Expand All @@ -47,12 +47,12 @@ Kat was previously used to estimate genome size, however at the time of writing
Seqtk is used for both the sub-sampling of reads and conversion of fasta files to fastq files in mikrokondo. The usage of seqtk to convert a fasta to a fastq is needed in certain typing tools requiring reads as input (this was a design decision to keep the pipeline generalizable).

- seqtk
- singularity: singularity container for seqtk
- docker: docker container for seqtk
- singularity: Singularity container for seqtk
- docker: Docker container for seqtk
- seed: A seed value for sub-sampling
- reads_ext: Extension of reads after sub-sampling, do not touch alter this unless doing pipeline development.
- assembly_fastq: Extension of the fastas after being converted to fastq files. Do no touch this unless doing pipeline development.
- report_tag: Name of seqtk data in the fi nal summary report. Do no touch this unless doing pipeline development.
- assembly_fastq: Extension of the fastas after being converted to fastq files. Do not change this unless doing pipeline development.
- report_tag: Name of seqtk data in the final summary report. Do not change this unless doing pipeline development.

### Rasusa
For long read data Rasusa is used for down sampling as it take read length into consideration when down sampling.
Expand All @@ -61,16 +61,16 @@ For long read data Rasusa is used for down sampling as it take read length into
- singularity: singularity container for rasusa
- docker: docker container for rasusa
- seed: A seed value for sub-sampling
- reads_ext: The extension of the generated fastq files. Do no touch this unless doing pipeline development.
- reads_ext: The extension of the generated fastq files. Do not change this unless doing pipeline development.

### FastP
Fastp is fast and widely used program for gathering of read quality metrics, adapter trimming, read filtering and read trimming. FastP has extensive options for configuration which are detailed in their documentation, but sensible defaults have been set. **Adapter trimming in Fastp is performed using overlap analysis, however if you do not trust this you can specify the sequencing adapters used directly in the additional arguments for Fastp**.

- fastp
- singularity: singularity container for FastP
- docker: docker container for FastP
- fastq_ext: extension of the output Fastp trimmed reads, do not touch this unless doing pipeline development.
- html_ext: Extension of the html report output by fastp, do no touch unless doing pipeline development.
- fastq_ext: extension of the output Fastp trimmed reads, Do not change this unless doing pipeline development.
- html_ext: Extension of the html report output by fastp, Do not touch unless doing pipeline development.
- json_ext: Extension of json report output by FastP do not touch unless doing pipeline development.
- report_tag: Title of FastP data in the summary report.
- **average_quality_e**: If a read/read-pair quality is less than this value it is discarded. Can be set from the command line with `--fp_average_quality`.
Expand Down
7 changes: 3 additions & 4 deletions subworkflows/local/clean_reads.nf
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,9 @@ workflow QC_READS {
log.info "Not down sampling ${it[0].id} as estimated sample depth is already below targeted depth of ${params.target_depth}."
}

to_down_sample = reads_sample.sub_sample.branch { meta, reads, sample_frac ->
short_reads: !meta.single_end // Hybrid and short reads sets still go to seqtk
long_reads: meta.single_end
to_down_sample = reads_sample.sub_sample.branch { it ->
short_reads: !it[0].single_end
long_reads: true
}

// Short reads and hybrid reads sets get sampled with seqtk still.
Expand Down Expand Up @@ -178,7 +178,6 @@ workflow QC_READS {
}

mash_screen_out = MASH_SCREEN(ch_prepped_reads, params.mash.mash_sketch ? file(params.mash.mash_sketch) : error("--mash_sketch ${params.mash_sketch} is invalid"))

versions = versions.mix(mash_screen_out.versions)

// Determine if sample is metagenomic
Expand Down
90 changes: 75 additions & 15 deletions tests/subworkflows/local/clean_reads/clean_reads.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -13,24 +13,25 @@ nextflow_workflow {
"""
input[0] = Channel.of(
[
[id: "SAMPlE1", hybrid:
false, assembly: false,
[id: "SAMPlE1",
hybrid: false,
sample: "SAMPLE1",
assembly: false,
downsampled: false,
single_end: false,
merge: false],
[
file("$baseDir/tests/data/reads/campy-staph1.fq.gz"),
file("$baseDir/tests/data/reads/campy-staph2.fq.gz")
],
"illumina"
]
])
input[1] = "illumina"
"""
}

params {
outdir = "results"

min_reads = 1
mash_sketch = "https://github.com/phac-nml/mikrokondo/raw/dev/tests/data/databases/campy-staph-ecoli.msh"
mh_min_kmer = 1

Expand All @@ -46,8 +47,24 @@ nextflow_workflow {
}

then {

assert workflow.success
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.R1.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.R2.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/MashSketches/SAMPlE1.mash.estimate.msh").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/MashScreen/SAMPlE1.mash.screen.reads.screen.screen").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").linesGzip.size() == 496
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").linesGzip.size() == 496
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").linesGzip.size() == 496
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").linesGzip.size() == 496
snapshot(workflow.out).match()

}

}
Expand All @@ -61,23 +78,24 @@ nextflow_workflow {
"""
input[0] = Channel.of(
[
[id: "SAMPlE1", hybrid:
false, assembly: false,
[id: "SAMPlE1",
hybrid: false,
assembly: false,
sample: "SAMPLE1",
downsampled: false,
single_end: true,
merge: false],
[
file("$baseDir/tests/data/reads/campy-staph1.fq.gz"),
],
"nanopore"
]
])
input[1] = "nanopore"
"""
}

params {
outdir = "results"

min_reads = 1
mash_sketch = "https://github.com/phac-nml/mikrokondo/raw/dev/tests/data/databases/campy-staph-ecoli.msh"
mh_min_kmer = 1

Expand All @@ -94,7 +112,18 @@ nextflow_workflow {

then {
assert workflow.success
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/MashSketches/SAMPlE1.mash.estimate.msh").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/MashScreen/SAMPlE1.mash.screen.reads.screen.screen").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").linesGzip.size() == 500
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").linesGzip.size() == 500
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.trimmed.reads.fastq.gz").linesGzip.size() == 500
snapshot(workflow.out).match()

}

}
Expand All @@ -108,7 +137,9 @@ nextflow_workflow {
"""
input[0] = Channel.of(
[
[id: "SAMPlE1", hybrid: false,
[id: "SAMPlE1",
hybrid: false,
sample: "SAMPLE1",
assembly: false,
downsampled: false,
single_end: true,
Expand Down Expand Up @@ -139,9 +170,19 @@ nextflow_workflow {
}

then {
assert workflow.success
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/Rasusa/SAMPlE1.rasusa.sample.sampled.reads.fastq.gz").exists()
snapshot(workflow.out).match()
assert workflow.success
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.sampled.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/Rasusa/SAMPlE1.rasusa.sample.sampled.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.final.sampled.reads.fastq.gz").linesGzip.size() == 5656
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.deconned.reads.fastq.gz").linesGzip.size() == 16680
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/Rasusa/SAMPlE1.rasusa.sample.sampled.reads.fastq.gz").linesGzip.size() == 5656
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.trimmed.reads.fastq.gz").linesGzip.size() == 16680
snapshot(workflow.out).match()

}

}
Expand All @@ -154,7 +195,9 @@ nextflow_workflow {
"""
input[0] = Channel.of(
[
[id: "SAMPlE1", hybrid: false,
[id: "SAMPlE1",
hybrid: false,
sample: "SAMPLE1",
assembly: false,
downsampled: false,
single_end: false,
Expand Down Expand Up @@ -187,8 +230,25 @@ nextflow_workflow {

then {
assert workflow.success
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.html").exists()
assert path("${launchDir}/results/Reads/Quality/Trimmed/FastP/SAMPlE1.fastp.summary.json").exists()
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.SAMPlE1_R2.final.sampled.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R1.seqtk.sample.sampled.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R2.seqtk.sample.sampled.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").exists()
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.SAMPlE1_R1.final.sampled.reads.fastq.gz").linesGzip.size() == 4860
assert path("${launchDir}/results/Reads/FinalReads/SAMPLE1/SAMPlE1.SAMPlE1_R2.final.sampled.reads.fastq.gz").linesGzip.size() == 4860
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R1.deconned.reads.fastq.gz").linesGzip.size() == 16680
assert path("${launchDir}/results/Reads/Processing/Dehosting/SAMPlE1.deconned.R2.deconned.reads.fastq.gz").linesGzip.size() == 16680
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R1.seqtk.sample.sampled.reads.fastq.gz").linesGzip.size() == 4860
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/DownSampled/SeqTK/SAMPlE1.SAMPlE1_R2.seqtk.sample.sampled.reads.fastq.gz").linesGzip.size() == 4860
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R1.trimmed.reads.fastq.gz").linesGzip.size() == 16680
assert path("${launchDir}/results/Reads/Processing/Dehosting/Trimmed/FastP/SAMPlE1.fastp.R2.trimmed.reads.fastq.gz").linesGzip.size() == 16680
snapshot(workflow.out).match()
}
}
Expand Down
Loading

0 comments on commit 5da1919

Please sign in to comment.