Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage discrepancy with inStrain profile #99

Open
cswift3 opened this issue Dec 18, 2021 · 6 comments
Open

Coverage discrepancy with inStrain profile #99

cswift3 opened this issue Dec 18, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@cswift3
Copy link

cswift3 commented Dec 18, 2021

Hello,
I am using inStrain profile to calculate SNVs for long-read (nanopore) sequencing data. However, I am concerned that inStrain is discounting a large proportion of my input reads. The BAM file that I am using as input to inStrain profile has ~400x coverage. I found that after running inStrain profile, the coverage is between 100 and 200x in most positions. I have checked the ReadFiltering plot (attached to this post below) as well as the mapping_info.tsv and 91% (25,203 out of 27,747) of the reads are passing filter, so I am confused as to why the reported coverage is so much less.
The command I used is as follows:
inStrain profile -c 10 --pairing_filter all_reads --min_read_ani 0.92 /path/sorted.bam /path/reference.fasta -o /path/output/directory

Please advise ASAP as I am currently in the review process for manuscript. If I cannot get this resolved quickly, unfortunately I will have to use another tool for variant calling.

Thank you,
Candice

inStrain_rerun_barcode85_ReadFiltering_plot.pdf

@MrOlm
Copy link
Owner

MrOlm commented Dec 18, 2021

Hi Candice,

Hmm interesting. A couple of questions-

  1. What program to calculate coverage gave you the count of ~400x?

  2. Could you attached the file mapping_info.tsv from the inStrain output file?

Thanks,
Matt

@cswift3
Copy link
Author

cswift3 commented Dec 19, 2021 via email

@MrOlm
Copy link
Owner

MrOlm commented Dec 20, 2021

Hi Candice,

Thanks for all of this- it certainly seems like inStrain is struggling with the long-reads. I'll note that I've never run it with long reads before, but I'd really like it to work with long reads, so I appreciate your help troubleshooting this.

The mapping_info.tsv file you attached didn't work; I believe you have to attach it on GitHub rather than email it. It would also be great if you could attach the SNVs.tsv file as well.

I'll look at the file once I get it, but in thinking through potential problems, do you happen to know if you have multiple reads with the same name? inStrain treats reads with the same name as being the "same", so that could mess with the count?

Best,
Matt

@cswift3
Copy link
Author

cswift3 commented Dec 21, 2021

Hi Matt,
Thanks for looking into this. I really appreciate it. I am attaching the two files via GitHub. I had to change the file extensions to TXT, but they are the same as the original TSV output from inStrain.
The fastq headers are unique.

Thanks,
Candice

[inStrain_rerun_barcode85_mapping_info_tsv.txt](https://github.com/MrOlm/inStrain/files/7755872/in
inStrain_rerun_barcode85_SNVs_tsv.txt
Strain_rerun_barcode85_mapping_info_tsv.txt)
.

@MrOlm
Copy link
Owner

MrOlm commented Dec 21, 2021

OK- I think I got to the root of the problem. Thanks for your help.

When calling SNVs, inStrain uses an empirically generated distribution of "expected" variants based on Q30 bases. So when calling SNVs (and calculating coverage, it seems), it only looks at reads with q30. I assume that because you have long reads, many don't reach this q30 threshold, and only the ones that do are "counted" in the SNV table.

Adjusting the minimum threshold to be something other than Q30 is trivial (it's line 154 in the file "profile_utilities.py"), but adjusting the SNV-calling assumptions and distributions is complicated.

This is something that I will fix, and I'll do it ASAP, but I'm not sure how long it will take. I'll of course ping you as soon as it's finished. But if you need an immediate solution, I'd say your options are to use another program, or to state that only reads with a quality sore of at least Q30 were used when calling SNVs.

Sorry again for the confusion and thanks for your help tracking this down and helping me improve inStrain!

-Matt

@MrOlm MrOlm added the enhancement New feature or request label Dec 21, 2021
@cswift3
Copy link
Author

cswift3 commented Dec 21, 2021

Hi Matt,
Great! I'm glad you got to the bottom of it! Definitely let me know when you have a fix and we'll see where I'm at in responding to other reviewer requests. I really appreciate your help.

Thanks again,
Candice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants