Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of KmerFinder subworkflow Custom Quast, and Custom MultiQC Reports #135

Merged
merged 62 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
ebcd833
add kmerfinder for shortreads
Daniel-VM Nov 6, 2023
68a9c5d
add module kmerfinder summary report
Daniel-VM Nov 6, 2023
fd8a934
add module to find and download reference genome
Daniel-VM Nov 14, 2023
c81cc29
fix kmerfinder summary input
Daniel-VM Nov 14, 2023
f989ae3
update kmerfinder output file extension
Daniel-VM Nov 17, 2023
20be5f9
add kmerfinder refseqid to meta
Daniel-VM Nov 17, 2023
16b5068
fix url in kmerfinder donwload ref
Daniel-VM Nov 17, 2023
10f68ba
temporary commit
Daniel-VM Nov 17, 2023
a48518d
group assemblies by refseqid
Daniel-VM Nov 19, 2023
1e436ea
allow global quast and by-refseqid quast
Daniel-VM Nov 19, 2023
7969e11
move kmerfinder processing to subworkflow plus output refactoring
Daniel-VM Nov 21, 2023
ef85494
Merge branch 'bu-isciii_tmp' into buisciii-develop
Daniel-VM Nov 21, 2023
3e8d9f1
allow quast to standard and byrefseq data
Daniel-VM Nov 21, 2023
53cc7ec
add byrefseq quast reports to multiqc and patch quast
Daniel-VM Nov 22, 2023
7aff2f6
add byrefseq quast reports to multiqc and patch quast
Daniel-VM Nov 22, 2023
de8e877
update multiqc and append fastp metrics to assmebly metrics df
Daniel-VM Jan 2, 2024
7103bfa
add new method to complie kmerfinder results into multiqc report
Daniel-VM Jan 2, 2024
854782d
add long reads assembly metrics to custom multiqc
Daniel-VM Jan 2, 2024
6fb64b3
Merge branch 'buisciii-develop' of https://github.com/Daniel-VM/bacas…
Daniel-VM Jan 2, 2024
0f3970e
fix custom multiqc when kmerfinder is not invoked
Daniel-VM Jan 3, 2024
247f592
add custom multiqc for hybrid assembly
Daniel-VM Jan 4, 2024
df1403d
add file-check-exist and rename variables
Daniel-VM Jan 4, 2024
c42735e
update documentation and add save_trimmed option
Daniel-VM Jan 5, 2024
3357785
add fastp additional options and fix input sample path
Daniel-VM Jan 5, 2024
1847f39
allow module to emit tsv report
Daniel-VM Jan 15, 2024
b73e3f0
add kmerfinder for shortreads
Daniel-VM Nov 6, 2023
764a3d0
add module kmerfinder summary report
Daniel-VM Nov 6, 2023
0f07718
add module to find and download reference genome
Daniel-VM Nov 14, 2023
e072b2d
fix kmerfinder summary input
Daniel-VM Nov 14, 2023
8baded1
update kmerfinder output file extension
Daniel-VM Nov 17, 2023
9bdc67e
add kmerfinder refseqid to meta
Daniel-VM Nov 17, 2023
97c2866
fix url in kmerfinder donwload ref
Daniel-VM Nov 17, 2023
0724141
temporary commit
Daniel-VM Nov 17, 2023
071d334
group assemblies by refseqid
Daniel-VM Nov 19, 2023
7822c4b
allow global quast and by-refseqid quast
Daniel-VM Nov 19, 2023
149071f
move kmerfinder processing to subworkflow plus output refactoring
Daniel-VM Nov 21, 2023
6e7e0fa
allow quast to standard and byrefseq data
Daniel-VM Nov 21, 2023
a184e41
add byrefseq quast reports to multiqc and patch quast
Daniel-VM Nov 22, 2023
997fc91
update multiqc and append fastp metrics to assmebly metrics df
Daniel-VM Jan 2, 2024
07f6c39
add new method to complie kmerfinder results into multiqc report
Daniel-VM Jan 2, 2024
e0481e0
add long reads assembly metrics to custom multiqc
Daniel-VM Jan 2, 2024
04ca4b3
fix custom multiqc when kmerfinder is not invoked
Daniel-VM Jan 3, 2024
3c6297a
add custom multiqc for hybrid assembly
Daniel-VM Jan 4, 2024
660799d
add file-check-exist and rename variables
Daniel-VM Jan 4, 2024
aeaa20e
update documentation and add save_trimmed option
Daniel-VM Jan 5, 2024
95d965d
add fastp additional options and fix input sample path
Daniel-VM Jan 5, 2024
b33a37d
allow module to emit tsv report
Daniel-VM Jan 15, 2024
0703417
fix kmerFinder by narrowing down the reference genomes to a single wi…
Daniel-VM Jan 17, 2024
20f891e
Fix divergen branches when mergin remote origin/buisciii-develop into…
Daniel-VM Mar 19, 2024
923696d
fix uncompress method to parse kmerfinder db
Daniel-VM May 17, 2024
b626c5d
add new kmerfinderdb untar method and fix standalone py
Daniel-VM May 17, 2024
75faba6
fix step to prepare kmerfinderdb
Daniel-VM May 20, 2024
fd96b5a
add kmerfinder to pipeline tests
Daniel-VM May 20, 2024
0fca1d1
kmerfinder subworkflow cleaning
Daniel-VM May 20, 2024
11010c7
Merge branch 'buisciii-develop' into dev
Daniel-VM May 20, 2024
7778a12
remove unnecessary dependencies after merging branch
Daniel-VM May 20, 2024
820ca64
fix linting after mergin branch
Daniel-VM May 20, 2024
0410564
update CHANGLEOG in #135
Daniel-VM May 20, 2024
8967e55
add reviewer suggestions #135 pt.1
Daniel-VM May 23, 2024
7075f3d
fix multqc channels
Daniel-VM May 23, 2024
53629ce
add reviewer suggestions #135 pt.2
Daniel-VM May 23, 2024
7b66a5e
fix test_long_miniasm git CI test in #135
Daniel-VM May 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Changed`

- [#135](https://github.com/nf-core/bacass/pull/135) Replaced nf-core MultiQC module with a custom MultiQC module.

### `Added`

- [#135](https://github.com/nf-core/bacass/pull/135) Implementation of KmerFinder subworkflow Custom Quast, and Custom MultiQC Reports:

- Added KmerFinder subworkflow for read quality control, purity assessment, and sample grouping based on reference genome estimation.
- Enhanced Quast Assembly QC to run both general and reference genome-based analyses when KmerFinder is invoked.
- Implemented custom MultiQC module with multiqc_config.yml files for different assembly modes (short, long, hybrid).
- Generated custom MultiQC HTML report consolidating metrics from KmerFinder, Quast, and other relevant sources.

- [#133](https://github.com/nf-core/bacass/pull/133) Update nf-core/bacass to the new nf-core 2.14.1 `TEMPLATE`.

### `Fixed`
Expand Down
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,12 @@ On release, automated continuous integration tests run the pipeline on a full-si

### Short Read Assembly

This pipeline is primarily for bacterial assembly of next-generation sequencing reads. It can be used to quality trim your reads using [FastP](https://github.com/OpenGene/fastp) and performs basic sequencing QC using [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Afterwards, the pipeline performs read assembly using [Unicycler](https://github.com/rrwick/Unicycler). Contamination of the assembly is checked using [Kraken2](https://ccb.jhu.edu/software/kraken2/) to verify sample purity.
This pipeline is primarily for bacterial assembly of next-generation sequencing reads. It can be used to quality trim your reads using [FastP](https://github.com/OpenGene/fastp) and performs basic sequencing QC using [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Afterwards, the pipeline performs read assembly using [Unicycler](https://github.com/rrwick/Unicycler). Contamination of the assembly is checked using [Kraken2](https://ccb.jhu.edu/software/kraken2/) and [Kmerfinder](https://bitbucket.org/genomicepidemiology/kmerfinder/src/master/) to verify sample purity.

### Long Read Assembly

For users that only have Nanopore data, the pipeline quality trims these using [PoreChop](https://github.com/rrwick/Porechop) and assesses basic sequencing QC utilizing [NanoPlot](https://github.com/wdecoster/NanoPlot) and [PycoQC](https://github.com/a-slide/pycoQC).
For users that only have Nanopore data, the pipeline quality trims these using [PoreChop](https://github.com/rrwick/Porechop) and assesses basic sequencing QC utilizing [NanoPlot](https://github.com/wdecoster/NanoPlot) and [PycoQC](https://github.com/a-slide/pycoQC). Contamination of the assembly is checked using [Kraken2](https://ccb.jhu.edu/software/kraken2/) and [Kmerfinder](https://bitbucket.org/genomicepidemiology/kmerfinder/src/master/) to verify sample purity.

The pipeline can then perform long read assembly utilizing [Unicycler](https://github.com/rrwick/Unicycler), [Miniasm](https://github.com/lh3/miniasm) in combination with [Racon](https://github.com/isovic/racon), [Canu](https://github.com/marbl/canu) or [Flye](https://github.com/fenderglass/Flye) by using the [Dragonflye](https://github.com/rpetit3/dragonflye)(\*) pipeline. Long reads assembly can be polished using [Medaka](https://github.com/nanoporetech/medaka) or [NanoPolish](https://github.com/jts/nanopolish) with Fast5 files.

> [!NOTE]
Expand All @@ -47,6 +48,10 @@ For users specifying both short read and long read (NanoPore) data, the pipeline

In all cases, the assembly is assessed using [QUAST](http://bioinf.spbau.ru/quast). The resulting bacterial assembly is furthermore annotated using [Prokka](https://github.com/tseemann/prokka), [Bakta](https://github.com/oschwengers/bakta) or [DFAST](https://github.com/nigyta/dfast_core).

If Kmerfinder is invoked, the pipeline will group samples according to the [Kmerfinder](https://bitbucket.org/genomicepidemiology/kmerfinder/src/master/)-estimated reference genomes. Afterwards, two QUAST steps will be carried out: an initial ('general') [QUAST](http://bioinf.spbau.ru/quast) of all samples without reference genomes, and subsequently, a 'by reference genome' [QUAST](http://bioinf.spbau.ru/quast) to aggregate samples with their reference genomes.

> NOTE: This scenario is supported when [Kmerfinder](https://bitbucket.org/genomicepidemiology/kmerfinder/src/master/) analysis is performed only.
## Usage

> [!NOTE]
Expand Down
166 changes: 166 additions & 0 deletions assets/multiqc_config_hybrid.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
report_comment: >
This report has been generated by the <a href="https://github.com/nf-core/bacass/releases/tag/dev" target="_blank">nf-core/bacass</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://nf-co.re/bacass/dev/docs/output" target="_blank">documentation</a>.
data_format: "yaml"

max_table_rows: 10000

run_modules:
- custom_content
- fastqc
- fastp
- nanostat
- porechop
- pycoqc
- kraken2
- quast
- prokka
- bakta

exclude_modules:
- general_stats

module_order:
- fastqc:
name: "PREPROCESS: FastQC (raw reads)"
info: "This section of the report shows FastQC results for the raw reads before adapter trimming."
path_filters:
- "./fastqc/*.zip"
- fastp:
name: "PREPROCESS: fastp (adapter trimming)"
info: "This section of the report shows fastp results for reads after adapter and quality trimming."
path_filters:
- "./fastp/*.json"
- nanostat:
name: "PREPROCESS: Nanoplot"
info: "This section of the report shows Nanoplot results for nanopore sequencing data."
path_filters:
- "./nanoplot/*.txt"
- porechop:
name: "PREPROCESS: Porechop"
info: "This section of the report shows Porechop results for reads after adapter trimming."
path_filters:
- "./porechop/*.log"
- pycoqc:
name: "PREPROCESS: PycoQC"
info: "This section of the report shows PycoQC results for quality control of long-read sequencing data."
path_filters:
- "./pycoqc/*.txt"
- kraken2:
name: "CONTAMINATION ANALYSIS: Kraken 2"
info: "This section of the report shows Kraken 2 classification results for reads after adapter trimming with fastp."
path_filters:
- ".*kraken2_*/*report.txt"
- quast:
name: "ASSEMBLY: Quast"
info: "This section of the report shows Quast QC results for assembled genomes with Unicycler."
path_filters:
- "./quast/*/report.tsv"
- prokka:
name: "ANNOTATION: Prokka"
info: "This section of the report shows Prokka annotation results for reads after adapter trimming and quality trimming."
path_filters:
- "./prokka/*.txt"
- bakta:
name: "ANNOTATION: Bakta"
info: "This section of the report shows Bakta mapping and annotation results for reads after adapter trimming."
path_filters:
- "./bakta/*.txt"

report_section_order:
fastqc:
after: general_stats
fastp:
after: general_stats
nanostat:
after: general_stats
porechop:
before: nanostat
kraken2:
after: general_stats
quast:
after: general_stats
prokka:
before: nf-core-bacass-methods-description
bakta:
before: nf-core-bacass-methods-description
nf-core-bacass-methods-description:
order: -1000
software_versions:
order: -1001
nf-core-bacass-summary:
order: -1002

custom_data:
summary_assembly_metrics:
section_name: "De novo assembly metrics (shorts & long reads)"
description: "generated by nf-core/bacass"
plot_type: "table"
headers:
"Sample":
description: "Input sample names"
format: "{:,.0f}"
"# Input short reads":
description: "Total number of input reads in raw fastq files"
format: "{:,.0f}"
"# Trimmed short reads (fastp)":
description: "Total number of reads remaining after adapter/quality trimming with fastp"
format: "{:,.0f}"
"# Input long reads":
description: "Total number of input reads in raw fastq files"
format: "{:,.0f}"
"# Median long reads lenght":
description: "Median read lenght (bp)"
format: "{:,.0f}"
"# Median long reads quality":
description: "Median read quality (Phred scale)"
format: "{:,.0f}"
"# Contigs (hybrid assembly)":
description: "Total number of contigs calculated by QUAST"
format: "{:,.0f}"
"# Largest contig (hybrid assembly)":
description: "Size of largest contig calculated by QUAST"
format: "{:,.0f}"
"# N50 (hybrid assembly)":
description: "N50 metric for de novo assembly as calculated by QUAST"
format: "{:,.0f}"
"# % Genome fraction (hybrid assembly)":
description: "% genome fraction calculated by QUAST"
format: "{:,.2f}"
"# Best hit (Kmerfinder)":
description: "Specie name of the best hit from Kmerfinder (using short reads)"
format: "{:,.0f}"
"# Best hit assembly ID (Kmerfinder)":
description: "Assembly ID of the best hit from Kmerfinder (using short reads)"
format: "{:,.0f}"
"# Best hit query coverage (Kmerfinder)":
description: "Query coverage value of the best hit from Kmerfinder (using short reads)"
format: "{:,.0f}"
"# Best hit depth (Kmerfinder)":
description: "Depth of the best hit from Kmerfinder (using short reads)"
format: "{:,.0f}"
"# Second hit (Kmerfinder)":
description: "Specie name of the second hit from Kmerfinder (using short reads)"
format: "{:,.0f}"
"# Second hit assembly ID (Kmerfinder)":
description: "Assembly ID of the second hit from Kmerfinder (using short reads)"
format: "{:,.0f}"
"# Second hit query coverage (Kmerfinder)":
description: "Query coverage value of the second hit from Kmerfinder (using short reads)"
format: "{:,.0f}"
"# Second hit depth (Kmerfinder)":
description: "Depth of the second hit from Kmerfinder (using short reads)"
format: "{:,.0f}"

export_plots: true

# # Customise the module search patterns to speed up execution time
# # - Skip module sub-tools that we are not interested in
# # - Replace file-content searching with filename pattern searching
# # - Don't add anything that is the same as the MultiQC default
# # See https://multiqc.info/docs/#optimise-file-search-patterns for details
sp:
fastp:
fn: "*.fastp.json"
140 changes: 140 additions & 0 deletions assets/multiqc_config_long.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
report_comment: >
This report has been generated by the <a href="https://github.com/nf-core/bacass/releases/tag/dev" target="_blank">nf-core/bacass</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://nf-co.re/bacass/dev/docs/output" target="_blank">documentation</a>.
data_format: "yaml"

max_table_rows: 10000

run_modules:
- custom_content
- nanostat
- porechop
- pycoqc
- kraken2
- quast
- prokka
- bakta

exclude_modules:
- general_stats

module_order:
- nanostat:
name: "PREPROCESS: Nanoplot"
info: "This section of the report shows Nanoplot results for nanopore sequencing data."
path_filters:
- "./nanoplot/*.txt"
- porechop:
name: "PREPROCESS: Porechop"
info: "This section of the report shows Porechop results for reads after adapter trimming."
path_filters:
- "./porechop/*.log"
- pycoqc:
name: "PREPROCESS: PycoQC"
info: "This section of the report shows PycoQC results for quality control of long-read sequencing data."
path_filters:
- "./pycoqc/*.txt"
- kraken2:
name: "CONTAMINATION ANALYSIS: Kraken 2"
info: "This section of the report shows Kraken 2 classification results for reads after adapter trimming with fastp."
path_filters:
- ".*kraken2_*/*report.txt"
- quast:
name: "ASSEMBLY: Quast"
info: "This section of the report shows Quast QC results for assembled genomes with Unicycler."
path_filters:
- "./quast/*/report.tsv"
- prokka:
name: "ANNOTATION: Prokka"
info: "This section of the report shows Prokka annotation results for reads after adapter trimming and quality trimming."
path_filters:
- "./prokka/*.txt"
- bakta:
name: "ANNOTATION: Bakta"
info: "This section of the report shows Bakta mapping and annotation results for reads after adapter trimming."
path_filters:
- "./bakta/*.txt"

report_section_order:
nanostat:
after: general_stats
porechop:
before: nanostat
kraken2:
after: general_stats
quast:
after: general_stats
prokka:
before: nf-core-bacass-methods-description
bakta:
before: nf-core-bacass-methods-description
nf-core-bacass-methods-description:
order: -1000
software_versions:
order: -1001
nf-core-bacass-summary:
order: -1002

custom_data:
summary_assembly_metrics:
section_name: "De novo assembly metrics (long-reads)"
description: "generated by nf-core/bacass"
plot_type: "table"
headers:
"Sample":
description: "Input sample names"
format: "{:,.0f}"
"# Input reads":
description: "Total number of input reads in raw fastq files"
format: "{:,.0f}"
"# Median read lenght":
description: "Median read lenght (bp)"
format: "{:,.0f}"
"# Median read quality":
description: "Median read quality (Phred scale)"
format: "{:,.0f}"
"# Contigs":
description: "Total number of contigs calculated by QUAST"
format: "{:,.0f}"
"# Largest contig":
description: "Size of largest contig calculated by QUAST"
format: "{:,.0f}"
"# N50":
description: "N50 metric for de novo assembly as calculated by QUAST"
format: "{:,.0f}"
"# % Genome fraction":
description: "% genome fraction calculated by QUAST"
format: "{:,.2f}"
"# Best hit (Kmerfinder)":
description: "Specie name of the best hit from Kmerfinder"
format: "{:,.0f}"
"# Best hit assembly ID (Kmerfinder)":
description: "Assembly ID of the best hit from Kmerfinder"
format: "{:,.0f}"
"# Best hit query coverage (Kmerfinder)":
description: "Query coverage value of the best hit from Kmerfinder"
format: "{:,.0f}"
"# Best hit depth (Kmerfinder)":
description: "Depth of the best hit from Kmerfinder"
format: "{:,.0f}"
"# Second hit (Kmerfinder)":
description: "Specie name of the second hit from Kmerfinder"
format: "{:,.0f}"
"# Second hit assembly ID (Kmerfinder)":
description: "Assembly ID of the second hit from Kmerfinder"
format: "{:,.0f}"
"# Second hit query coverage (Kmerfinder)":
description: "Query coverage value of the second hit from Kmerfinder"
format: "{:,.0f}"
"# Second hit depth (Kmerfinder)":
description: "Depth of the second hit from Kmerfinder"
format: "{:,.0f}"

export_plots: true
# # Customise the module search patterns to speed up execution time
# # - Skip module sub-tools that we are not interested in
# # - Replace file-content searching with filename pattern searching
# # - Don't add anything that is the same as the MultiQC default
# # See https://multiqc.info/docs/#optimise-file-search-patterns for details
Loading
Loading