Data is published on Zenodo.org server as well: Please cite when you used Genome-Size_Annotation tool
Script is Written by Mr Arpit Mathur, who at the time of devloping is working at Dr Nikhil Patkar's lab at ACTREC, Tata Memorial Center, Navi Mumbai
Scripts to Download Genome Size from NCBI Server
Metagenomics Analysis often require genome size to calculate abundance estimation. for eg a normalized abundance is calculate on the concept of depth of coverage which is essentially total amount of assigned reads to a taxa divided by taxa sequence length.
In pipelines such as bowtie2 there is already presenece of genome size in the output but in many cases like kraken/kraken2 and kaiju based classifications there is no mention of genome size of taxa classified, rather only reads assigned are mentioned. Hence to calculate depth of coverage as parameter of abundance estimation we need genome size.
One way is to calculate genomes size from the .fasta or .fna files present in the taxonomy folder. But in metagenomic pipelines like kraken/kraken2, classification is done towards/directed towards a clade and not to all strains/sub species. Hence genome size estimation of the clade become a challenge since there is no said rule on how to calculate estimated genome length of clade when genome size of sub species / strains is given. Hence even though fasta file of the kraken2 taxonomy contains genome size information, it is in the best interest to calculate genome size of clade itself. Altough we have provided script to calculate genome size of each species present in the fasta file.
Secondly in classification algorithms like kaiju, there is no .fna or .fasta file available rather only index files are available from where the alignement is done and we get reads assigned table as output species wise. There we even do not have an option to fetch genome sizes at all.
We devloped bash scripts to download genome size directly from NCBI server provided we have taxid list available with us. The only limitation is that some species/taxa which have assigned taxid via NCBI but do not have defined genome do not give genome size from our devloped scripts. For these cases we take help of data minning from Genome Check API tool from NCBI (https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=).
Some taxid still will not be having expected genome size in Genome Check NCBI API. For these cases we took help of AI tools like perplexity.com to fetch genome sizes with mentioning of the sources/research papers/portal which mentions it. Our curated database mentions source of the information from where genome size is extracted.
Our curated dataset is of bacterial species or strains detected along with their genome size in 107 AML patients diagnosed with Sepsis clinically. Cell-free DNA profiles of patients were built and sequencing was done in Illumina (NovaSeq and NextSeq). Bioinformatic analysis was performed using two classification algorithms namely kraken2 and kaiju. For kraken2 based classification reference bacterial index developed by Carlo Ferravante et al (Zenodo 2020) (link: https://zenodo.org/records/4055180) was used, while for kaiju-based classification reference database named "nr_euk" dated "2023-05-10" (link: https://bioinformatics-centre.github.io/kaiju/downloads.html) was used.
Kraken2 dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kraken_genome_annotation"
Kaiju dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kaiju_genome_annotation"
*Please note that for kraken2 curated dataset, we used data mining from the AI search engine perplexity.ai while for kaiju we did not use perplexity, ai, and any species whose genome size was not found was labeled "NA"