Here you will find a collection of scripts to analyse long-read sequencing raw data obtained with the Oxford Nanopore Technology. Each script will run a PBS job on a HPC cluster, but it can be modified to meet other requirements (e.g. SLURM, local...).
-
Script 1: 01_basecalling_dorado.pbs. This script performs barcode classification in-line with basecalling using Dorado under the most accurate basecalling method (it requires more time). The .fast5 file generated by the sequencer should be located in the directory
./00_data/00_raw
. Depending on the library preparation kit used, you might have to change the value of the option--kit-name
. Then, the BAM file is splitted into a BAM file per barcode. This job is computationally expensive, so make sure you provide enough resources (GPU). -
Script 2: 02_read_statistics.pbs. This script analyses the quality of the raw reads using NanoPlot (also provides some statistics) and FastQC.
-
Script 3: 03_quality_filtering_porechop.pbs. This script finds and removes adapters from the reads using Porechop.
-
Script 4: 04_rename_IDs.pbs. This script renames the header of the fastq reads so that they are unique for downstream analyses. Basically, the headers of the sequences are modified to include a suffix derived from the last _ of the last field in the header line.
-
Script 5: 05_quality_filtering_bbmap.pbs. This script performs quality filering of the reads using BBMap. First,
reformat.sh
will discard any sequences shorter than 250 bp, and, second,bbduk.sh
will trim both ends of the reads to a minimum quality of 10 using the Phred algorithm. -
Script 6: 06_reads_statistics.pbs. This script analyses the quality of the filtered reads using NanoPlot and FastQC.
-
Script 7: 07_assembly_flye.pbs. This script assemblies the reads into contings using Flye for high quality reads (in combination with
dorado basecaller sup
in script 1). -
Script 8: 08_assembly_polishing_medaka.pbs. This script creates consensus assembled genomes using Medaka.
-
Script 9: 09_quast.pbs. This script calculates the statistics of the assembled genomes using Quast.
-
Script 10: 10_checkM.pbs. This script assesses the quality of the assembled draft genomes using CheckM.
-
Script 11: 11_genome_coverage.pbs. This script calculates the genome coverage of the assembled draft genome by mapping the fastq reads used for assembly using [minimap2] (https://github.com/lh3/minimap2) and SAMtools.
-
Script 12: 12_reordering_genomes_mauve.pbs. This script reorder draft contigs according to the reference genome using Mauve. It will help to determine global rearrangement structures based on next gene annotations.
-
Script 13: 13_annotation_prokka.pbs. This script annotated the assembled genomes using Prokka. Annotations will be first added from a reference genome with the parameter
--proteins
. Modify the command as desired. -
Script 14: 14_AMR_ABRicate.pbs. This script looks for antimicrobial resistance using all the databases in ABRicate.
-
Script 15: 15_pangenome_roary.pbs. This script construct the pangenome using Roary with the annotations from Prokka