Coverage discrepancy with inStrain profile #99

cswift3 · 2021-12-18T12:23:22Z

Hello,
I am using inStrain profile to calculate SNVs for long-read (nanopore) sequencing data. However, I am concerned that inStrain is discounting a large proportion of my input reads. The BAM file that I am using as input to inStrain profile has ~400x coverage. I found that after running inStrain profile, the coverage is between 100 and 200x in most positions. I have checked the ReadFiltering plot (attached to this post below) as well as the mapping_info.tsv and 91% (25,203 out of 27,747) of the reads are passing filter, so I am confused as to why the reported coverage is so much less.
The command I used is as follows:
inStrain profile -c 10 --pairing_filter all_reads --min_read_ani 0.92 /path/sorted.bam /path/reference.fasta -o /path/output/directory

Please advise ASAP as I am currently in the review process for manuscript. If I cannot get this resolved quickly, unfortunately I will have to use another tool for variant calling.

Thank you,
Candice

inStrain_rerun_barcode85_ReadFiltering_plot.pdf

MrOlm · 2021-12-18T19:22:58Z

Hi Candice,

Hmm interesting. A couple of questions-

What program to calculate coverage gave you the count of ~400x?
Could you attached the file mapping_info.tsv from the inStrain output file?

Thanks,
Matt

cswift3 · 2021-12-19T03:06:43Z

Hi Matt, Thanks so much for the quick response! I visualized the BAM file in Tablet and IGV. I also used another program that is part of the ARTIC field bioinformatics pipeline (specifically artic minion) to call the variants and it gave me ~400x at the position of interest (contrast to ~50x from instrain profile for this SNS). Exact values as follows: Tablet average coverage depth (for the entire BAM)=317.6x Tablet coverage at position 23,403=400x IGV coverage at position 23,403=347x ARTIC minion total reads at position 23,403=395x The disagreement between IGV and Tablet at this position is because reads with deletions/insertions are not included in the total reads. When these are added back in, IGV agrees with Tablet and ARTIC. So right now I have 3 different methods in agreement for the coverage and inStrain is the outlier. Attached is the mapping_info.tsv. I also tried attaching a zipped BAM and BAI file, but the file was too big to go through email. Thanks again, Candice

…

On Sat, Dec 18, 2021 at 2:23 PM Matt Olm ***@***.***> wrote: Hi Candice, Hmm interesting. A couple of questions- 1. What program to calculate coverage gave you the count of ~400x? 2. Could you attached the file mapping_info.tsv from the inStrain output file? Thanks, Matt — Reply to this email directly, view it on GitHub <#99 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK6JF7GHIA5KQIPYDI4CFF3URTNR3ANCNFSM5KKS7V5Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

MrOlm · 2021-12-20T19:49:31Z

Hi Candice,

Thanks for all of this- it certainly seems like inStrain is struggling with the long-reads. I'll note that I've never run it with long reads before, but I'd really like it to work with long reads, so I appreciate your help troubleshooting this.

The mapping_info.tsv file you attached didn't work; I believe you have to attach it on GitHub rather than email it. It would also be great if you could attach the SNVs.tsv file as well.

I'll look at the file once I get it, but in thinking through potential problems, do you happen to know if you have multiple reads with the same name? inStrain treats reads with the same name as being the "same", so that could mess with the count?

Best,
Matt

cswift3 · 2021-12-21T14:43:36Z

Hi Matt,
Thanks for looking into this. I really appreciate it. I am attaching the two files via GitHub. I had to change the file extensions to TXT, but they are the same as the original TSV output from inStrain.
The fastq headers are unique.

Thanks,
Candice

[inStrain_rerun_barcode85_mapping_info_tsv.txt](https://github.com/MrOlm/inStrain/files/7755872/in
inStrain_rerun_barcode85_SNVs_tsv.txt
Strain_rerun_barcode85_mapping_info_tsv.txt)
.

MrOlm · 2021-12-21T16:29:59Z

OK- I think I got to the root of the problem. Thanks for your help.

When calling SNVs, inStrain uses an empirically generated distribution of "expected" variants based on Q30 bases. So when calling SNVs (and calculating coverage, it seems), it only looks at reads with q30. I assume that because you have long reads, many don't reach this q30 threshold, and only the ones that do are "counted" in the SNV table.

Adjusting the minimum threshold to be something other than Q30 is trivial (it's line 154 in the file "profile_utilities.py"), but adjusting the SNV-calling assumptions and distributions is complicated.

This is something that I will fix, and I'll do it ASAP, but I'm not sure how long it will take. I'll of course ping you as soon as it's finished. But if you need an immediate solution, I'd say your options are to use another program, or to state that only reads with a quality sore of at least Q30 were used when calling SNVs.

Sorry again for the confusion and thanks for your help tracking this down and helping me improve inStrain!

-Matt

cswift3 · 2021-12-21T17:47:49Z

Hi Matt,
Great! I'm glad you got to the bottom of it! Definitely let me know when you have a fix and we'll see where I'm at in responding to other reviewer requests. I really appreciate your help.

Thanks again,
Candice

MrOlm added the enhancement New feature or request label Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage discrepancy with inStrain profile #99

Coverage discrepancy with inStrain profile #99

cswift3 commented Dec 18, 2021

MrOlm commented Dec 18, 2021

cswift3 commented Dec 19, 2021 via email

MrOlm commented Dec 20, 2021

cswift3 commented Dec 21, 2021

MrOlm commented Dec 21, 2021

cswift3 commented Dec 21, 2021

Coverage discrepancy with inStrain profile #99

Coverage discrepancy with inStrain profile #99

Comments

cswift3 commented Dec 18, 2021

MrOlm commented Dec 18, 2021

cswift3 commented Dec 19, 2021 via email

MrOlm commented Dec 20, 2021

cswift3 commented Dec 21, 2021

MrOlm commented Dec 21, 2021

cswift3 commented Dec 21, 2021