-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in total counts of each probability per modification in different pore chemistries #269
Comments
Hello @eesiribloom, Just a sanity check, are these plots generated from similar samples (and similar alignments). What are you looking for in terms of a mitigation strategy? The R10 models have a heavier tail as a result of more diverse examples used in training. In general, these models will have higher performance in more contexts. You should still be able to use the (default) estimated threshold calculated by Do you get similar counts when you run |
Duplicated: nanoporetech/dorado#1041 |
Apologies for the duplication as I wasn't sure where the best place to get an answer was. The plots are generated from two samples of the same cancer type and cohort, aligned with the same version and parameters of minimap2 to GRCh38. Only difference - not to minimise or underestimate this - is the pore chemistries and subsequent basecalling models. I suppose if I knew the large peak was some sort of artefact - it does seem strange to have this huge C:m count when the R10.1 samples don't - I might find a way to filter it out. Obviously I am willing to accept some differences in output based on the different basecalling models and pore chemistries but I'd be concerned about taking this big peak forward if it might affect downstream analyses, especially as it is (perhaps oddly) at the high threshold so not easily removed that way. Even just understanding the potential reason for the difference would be super helpful. |
I've tried running
For the R10.1 sample
Curious to know what you might recommend in this situation - maybe just sticking with default is best? But am I right in thinking the default estimated threshold is not specific to different modifications? One concern is that a --filter-percentile 0.1 seems quite permissive (e.g. threshold of 0.64 above isn't very high). Would you recommend upping this perhaps? That being said, what happens at sites which are hemimethylated e.g. only on one allele? I notice a distinct rise in the count of 5hmCG at ~0.5 probability and I wonder if this is allele-specific methylation at play. It seems a shame to filter out that data too... |
Just to follow up. I have 13 samples: 7 samples sequenced with R9.4.1 and 6 samples sequenced with R10.1 |
I have samples basecalled with dorado software (v0.4.1) with detection of 5mCG_5hmCG modifications enabled using either:
[email protected]
[email protected]
I've noticed the two "batches" have distinct distribution of probabilities in base modifications. In particular the R9.4.1 samples have a massive peak of C:m at the far right of the histogram which looks like some sort of artefact (second plot).
I'm wondering what the explanation for this would be and what is the best way to mitigate this issue?
command:
The text was updated successfully, but these errors were encountered: