Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in total counts of each probability per modification in different pore chemistries #269

Open
eesiribloom opened this issue Sep 26, 2024 · 5 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@eesiribloom
Copy link

I have samples basecalled with dorado software (v0.4.1) with detection of 5mCG_5hmCG modifications enabled using either:

[email protected]
[email protected]

I've noticed the two "batches" have distinct distribution of probabilities in base modifications. In particular the R9.4.1 samples have a massive peak of C:m at the far right of the histogram which looks like some sort of artefact (second plot).

I'm wondering what the explanation for this would be and what is the best way to mitigate this issue?

command:

modkit sample-probs \
${input_bam} \
--log-filepath ${log} \
--percentiles 0.1,0.25,0.5,0.75,0.9 \
--out-dir ${output} \
--hist \
--prefix ${prefix} \
--suppress-progress \
--force
done

Counts (1)
Counts

@ArtRand
Copy link
Contributor

ArtRand commented Sep 26, 2024

Hello @eesiribloom,

Just a sanity check, are these plots generated from similar samples (and similar alignments). What are you looking for in terms of a mitigation strategy?

The R10 models have a heavier tail as a result of more diverse examples used in training. In general, these models will have higher performance in more contexts. You should still be able to use the (default) estimated threshold calculated by modkit. If you want to use modification-specific thresholds, you should use a different value per-mod, per-chemistry. We generally don't recommend trying to use the same thresholds across model versions or chemistries.

Do you get similar counts when you run modkit summary?

@HalfPhoton
Copy link

Duplicated: nanoporetech/dorado#1041

@eesiribloom
Copy link
Author

Apologies for the duplication as I wasn't sure where the best place to get an answer was.

The plots are generated from two samples of the same cancer type and cohort, aligned with the same version and parameters of minimap2 to GRCh38. Only difference - not to minimise or underestimate this - is the pore chemistries and subsequent basecalling models.

I suppose if I knew the large peak was some sort of artefact - it does seem strange to have this huge C:m count when the R10.1 samples don't - I might find a way to filter it out.

Obviously I am willing to accept some differences in output based on the different basecalling models and pore chemistries but I'd be concerned about taking this big peak forward if it might affect downstream analyses, especially as it is (perhaps oddly) at the high threshold so not easily removed that way.

Even just understanding the potential reason for the difference would be super helpful.

@eesiribloom
Copy link
Author

I've tried running modkit summary on both samples
For the R9.4.1 sample

> sampling 10042 reads from BAM
> calculating threshold at 10(th) percentile
> calculated thresholds: C: 0.640625
# bases             C 
# total_reads_used  10042 
# count_reads_C     10042 
# pass_threshold_C  0.640625 
 base  code  pass_count  pass_frac    all_count  all_frac 
 C     h     17673       0.017328935  31400      0.027721277 
 C     -     265376      0.26020953   307517     0.2714893 
 C     m     736806      0.7224615    793787     0.70078945 

For the R10.1 sample

> sampling 10042 reads from BAM
> calculating threshold at 10(th) percentile
> calculated thresholds: C: 0.7519531
# bases             C 
# total_reads_used  10042 
# count_reads_C     10042 
# pass_threshold_C  0.7519531 
 base  code  pass_count  pass_frac   all_count  all_frac 
 C     h     30896       0.04399692  54607      0.07007546 
 C     m     392828      0.55939996  434672     0.557801 
 C     -     278507      0.3966031   289981     0.37212354 

Curious to know what you might recommend in this situation - maybe just sticking with default is best? But am I right in thinking the default estimated threshold is not specific to different modifications?

One concern is that a --filter-percentile 0.1 seems quite permissive (e.g. threshold of 0.64 above isn't very high). Would you recommend upping this perhaps?

That being said, what happens at sites which are hemimethylated e.g. only on one allele? I notice a distinct rise in the count of 5hmCG at ~0.5 probability and I wonder if this is allele-specific methylation at play. It seems a shame to filter out that data too...

@eesiribloom
Copy link
Author

Just to follow up. I have 13 samples: 7 samples sequenced with R9.4.1 and 6 samples sequenced with R10.1
Using modkit summary and looking at the pass_count for code m:
R9.4.1 samples range from 0.63-0.73 (mean 0.67)
R10.1 samples range from 0.52-0.69 (mean 0.60)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

3 participants