Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in forked repository: KmerFinder, Estimation of Reference Genome, Custom Quast, and Custom MultiQC Reports #109

Closed
Daniel-VM opened this issue Jan 4, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@Daniel-VM
Copy link
Contributor

Description of feature

Overview

Hello!, My colleagues and I have been actively working on enhancing the nf-core/bacass workflow to address lab-specific challenges in bacterial genome assembly. We are happy to add these improvements into the main nf-core/bacass repository in case you are interested.

Currently, these enhancements have been implemented in my local fork of nf-core/bacass on the buisciii-develop branch.

nextflow run main.nf \
        -profile singularity,test \
        --skip_kmerfinder false \
        --kmerfinderdb path/to/kmerfinder_db/bacteria \
        --ncbi_assembly_metadata path/to/ncbi_assembly_metadata/assembly_summary_bacteria.txt \
        --outdir ./results \
        -w ./work \
        -resume

Breaking down implementations:

1. Kmerfinder Subworkflow:

  • Added a local KmerFinder module for read quality control (QC) and purity assessment.
  • Developed a local module to compile KmerFinder results from all samples into a comprehensive CSV summary file.
  • Implemented a method to group input samples (*.fastq, *.fasta, and other files...) based on the reference genome estimated with KmerFinder.
  • Created a local module to identify the reference genome estimated with KmerFinder in the NCBI database and download this genome. This reference genome is then utilized to retrieve relevant metrics from QUAST, such as the percentage of genome fraction. This functionality is particularly valuable when input samples belong to different species, requiring more than one reference for a comprehensive by_reference_genome report.

2. Quast Assembly QC by Grouping Samples:

  • Modified Quast execution when KmerFinder is invoked. Now, Quast runs twice:
  • Initial 'general' Quast without reference genome files (*.fna, *.gff).
  • Subsequent 'by reference genome' Quast, providing a Quast reports that agregates samples and their reference genome (estimated with kmerfinder).

3. Custom MultiQC Reports:

  • Incorporated a custom MultiQC module into the workflow.
  • Added multiqc_config.yml files for short, long and hybrid assembly modes (they work when kmerfinder is invoked, otherwise standard multiqc report is generated).
  • Upon invoking KmerFinder, a custom MultiQC HTML report is generated using the MULTIQC() module. This report consolidates metrics from KmerFinder, Quast, and other relevant sources, presenting them together in an overview table located in the first section of the report. See image:

Screenshot from 2024-01-04 16-23-26

Foot note

If you think these improvements could be implemented in nf-core/bacass, let me know so I can work on the test data and test profile.

@Daniel-VM Daniel-VM added the enhancement New feature or request label Jan 4, 2024
@Daniel-VM Daniel-VM changed the title Request for forked repo enhancements: KmerFinder, Estimation of Reference Genome, Custom Quast, and Custom MultiQC Reports Improvements in forked repository: KmerFinder, Estimation of Reference Genome, Custom Quast, and Custom MultiQC Reports Jan 4, 2024
@d4straub
Copy link
Collaborator

d4straub commented Jan 8, 2024

Looks actually very interesting to me. But I do not have experience with those tools and did not run the branch. I trust your judgment for now that this indeed is good practice.

I browsed over it and there were some minor points, e.g. a param with skip_* (such as skip_kmerfinder) shouldnt be by default true I think. I am not very much into MultiQC, so I am not sure why a custom module is needed, I thought MultiQC can be configured to do almost anything, but you have your reasons I assume.

So yes, I think that would be nice to add into the nf-core repo.

@Daniel-VM
Copy link
Contributor Author

Apologies for the late response.

Great, I still have to fix some minor points in the code and run additional tests. Once it's ready, I will let you know.

Regarding the MultiQC custom table at the beginning of the report, it will play a similar role to the default "General Stats" table. However, in the custom table you can add metrics from both supported and non-suported MultiQC tools (in fact, any data you wish to). This custom table contains most of the key metrics that my group needs to check in bacterial genome assembly analysis. Since not all the tools we check have a supported module in MultiQC, we gather all the metrics from these tools (that we use for analysis) and consolidate them in a single table. The pipeline also exports this table in CSV format.

@d4straub
Copy link
Collaborator

Looking forward!

I see, such an overview table is certainly a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants