-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detailed calculation of breadth (and coverage) #185
Comments
Hi @anniewest - Your intuition on how breadth is calculated is correct- that's exactly how it's done. The reason why you might see a difference between inStrain and I also believe that the way inStrain reports coverage is the most "traditional"; it's just the average number of reads mapped to a region. InStrain's coverage = coverM's "mean", and inStrains "breadth" is coverM's "covered_fraction". https://github.com/wwood/CoverM Let me know if you still have questions! Matt |
Thanks Matt! I did actually have a follow-up question related to checkM: I've run inStrain on a dereplicated set of genomes derived from my metagenome samples. The outputs suggest little-to-no strain diversity for those genomes (popANI 0.9999, breadth > 0.5) among samples, despite a few of the genomes having fairly large strain heterogeneity estimations from checkM. I'd be interested to hear your thoughts on the matter if you had a spare moment? I have some evidence to believe that for one of the dereplicated genomes, the other bins in that dRep secondary cluster were much larger than expected for that species and may be assembling reads from two closely related species (based on phylogeny placement with GTDB representative genomes for the genus), resulting in high checkM strain heterogeneity estimates (of reasonable contaimination levels e.g. the winning dRep genome had 12.6% contamination of which 75% was est. strain diversity, while the other bins had 70-82% contamination of which >95% was est. strain diversity). Many thanks :) Annie |
Hi Annie, Yes- your explanation makes total sense to me. It seems like your bins are impacted by having multiple related strains, but that the intra-population diversity is low. Rather than look at within-sample popANI, you might be able to see this with metrics like nucleotide diversity and/or SNVs / kb. But all-in-all intra-population diversity and this strain heterogeneity of bins are pretty different things, so it's not surprising they show different trends. Best, |
Hi Matt,
Thanks for InStrain, it's a great piece of software! I'm no bioinformatician, but I've been trying to wrap my head around how the profile pipeline calculates breadth on a scaffold and genome level. After trawling through your underlying scripts, I have to say I'm no closer to an answer so I thought I'd open an issue to ask!
My understanding of the breadth calculation is that you count all the bases for a given reference genome that have at least one read mapped (i.e. >0), then divide that number by genome (or scaffold) length. Is this correct? Is there a fundamental step I'm missing where this is only based on scaffolds that have minimum coverage? I've tried looking through all the issues on this github page, but to no avail (I'm aware that CoverM is used in the quick_profile function, but not sure how that differs to what happens in the profile function). In order to compare the output, I also ran the BBMap pileup.sh basecov function on bam files (also used for inStrain), for which I've mapped my metagenomic reads to my set of reference genomes (bowtie2 --sensitive), and applied the formula described above to calculate breadth. These outputs differed (hence my question here), so I'm wondering where the differences are between the two methods? My gut instinct is that I'm missing something really obvious here but I'm too deep in the woods to figure it out.
On another note, I was curious as to why you report coverage as an average of reads mapped per scaffold, as opposed to traditional methods that normalise based on gene length (and often apply a cut off e.g. >10 reads mapped)?
Cheers!
The text was updated successfully, but these errors were encountered: