MultiPolish-Dragonflye-Evaluation
This project includes a versatile set of scripts for running the pipeline with various combinations of polishing models on Illumina and Nanopore reads. The goal is to systematically evaluate the impact of different polishing model combinations on assembly quality and completeness, providing researchers with a flexible toolkit for optimizing genome assemblies.
Acknowledgments: We express our gratitude to the authors and developers of the tools utilized in this project:
- Dragonflye: [@rpetit3] https://github.com/rpetit3/dragonflye
- Racon: [@isovic] https://github.com/isovic/racon
- Medaka: [@nanoporetech] https://github.com/nanoporetech/medaka
- Pilon: [@broadinstitute] https://github.com/broadinstitute/pilon
- Polypolish: [@rrwick] https://github.com/rrwick/Polypolish
- QUAST: [@ablab] https://github.com/ablab/quast
- BUSCO: https://gitlab.com/ezlab/busco
Standard Operating Procedure (SOP) for Evaluating Assemblies with Different Polishing Model Combinations Rationale: The purpose of this analysis is to systematically evaluate the impact of different polishing model combinations on the quality and completeness of genome assemblies generated by the Dragonflye pipeline. By employing various combinations of Racon, Medaka, Pilon, and Polypolish, we aim to identify the most effective strategy for enhancing assembly accuracy, contiguity, and overall quality. This analysis is crucial for optimizing the bioinformatics workflow, ensuring the robustness of genome assemblies, and providing valuable insights for future genomic studies.
Objective: To systematically evaluate and compare multiple assemblies generated by the Dragonflye pipeline using different polishing model combinations and report various metrics for each assembly.
Equipment and Software: Dragonflye pipeline (v1.1.1) Racon (version 1.5.0) Medaka (version 1.8.0) Pilon (version 1.24) Polypolish (version v0.5.0) QUAST (version 5.0.2) BUSCO (version 5.5.0) Computational resources as required
Polishing Model Combinations: Racon + Medaka + Pilon + Polypolish (R+M+Pi+Po) Racon + Polypolish (R+Po) Racon (R) Racon + Medaka + Pilon (R+M+Pi) Medaka + Pilon + Polypolish (M+Pi+Po) Racon + Medaka + Polypolish (R+M+Po)
Procedure:
- Sample Preparation: Ensure a consistent set of reads for all polishing model combinations.
- Assembly Generation: Run the Dragonflye pipeline with the specified polishing model combinations.
- Polishing: Apply Racon, Medaka, Pilon, and Polypolish as per the selected polishing model combinations.
- Metrics Collection: For each assembly, use the provided sample sheet to report the following metrics: Complete and single-copy BUSCOs (%) GC content (%) N50 Depth Total number of base pairs Average contig length Number of contigs Number of circular contigs Average depth of contigs
- Data Analysis: Compare and analyze the metrics for each assembly to identify trends and variations in performance. Utilize graphical representations for a comprehensive analysis.
- Documentation: Maintain a detailed record of the results, including metrics and analysis outcomes. Document any deviations or issues encountered during the analysis.
- Reporting: Prepare a comprehensive summary report highlighting key findings, comparisons between polishing model combinations, and any notable observations. Include graphical representations to enhance clarity.
- Optimization Strategies: Based on the analysis, consider optimization strategies for improving assembly quality, if necessary.
Scripts run_dragonflye.sh This script automates the execution of the Dragonflye assembly pipeline with different polishing model combinations. It takes care of running multiple commands and organizing the output directories.
Usage Ensure Dragonflye and dependencies are installed. Create a file named ids containing the list of sample IDs. Modify script variables in run_dragonflye.sh as needed. Execute the script: ./run_dragonflye.sh busco_summarise_results.py This script captures required metrics from BUSCO summary files generated by the BUSCO tool. It parses the summary files and outputs the results to a CSV file.
Usage Run BUSCO on your assemblies using the following general command: for f in /.fa; do busco -i $f -o $f --out_path /path/to/output_directory/busco_out -m geno -c 45 -l /path/to/busco_lineages/; done Replace /path/to/output_directory/ with the desired output directory path. Replace /path/to/busco_lineages/ with the path to the BUSCO lineages directory. Execute the script: python busco_summarise_results.py /path/to/busco_summaries/
extract_depth_from_logs.py This script extracts estimated sequencing depth and total number of base pairs from Dragonflye log files. Usage Replace 'your_parent_directory' with the actual path to the parent directory containing subdirectories. Replace 'depth_out.csv' with the desired output CSV file name. Execute the script: python extract_depth_from_logs.py
calculate_contig_avg_depth.sh This script extracts contig depth from an assembled FASTA file and calculates the average depth for all contigs. Usage ./calculate_contig_avg_depth.sh <input_fasta_file> Replace <input_fasta_file> with the path to your assembled FASTA file.
calculate_contig_avg_length.sh This script extracts contig length from an assembled FASTA file and calculates the average length for all contigs. Usage ./calculate_contig_avg_length.sh <input_fasta_file> Replace <input_fasta_file> with the path to your assembled FASTA file.
extract_contig_stats.sh This script extracts the number of contigs and the number of circular contigs from a FASTA file. Usage ./extract_contig_stats.sh <input_fasta_file>
Directory Structure dragonflye_out_1: Output directory for Command 1 dragonflye_out_2: Output directory for Command 2 ...
Happy coding and genome crafting!