-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for overlapping reads in Theoretical Sensitivity model #1802
base: master
Are you sure you want to change the base?
Conversation
@kachulis Would you have a moment to take a look at this? @davidbenjamin This might be of interest to you since you implemented the original TheoreticalHetSensitivity model. |
log.info("Calculating theoretical sensitivity at " + ALLELE_FRACTION.size() + " allele fractions."); | ||
|
||
List<TheoreticalSensitivityMetrics> tsm = TheoreticalSensitivity.calculateSensitivities(SAMPLE_SIZE, | ||
collector.getUnfilteredDepthHistogram(), collector.getUnfilteredBaseQHistogram(), ALLELE_FRACTION, collector.basesExcludedByOverlap, PCR_ERROR_RATE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collector.basesExcludedByOverlap
needs to be divided by total bases to get fraction overlapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I think this is fixed by using collector.getMetrics(dupeFilter, adapterFilter, mapqFilter, pairFilter).PCT_EXC_OVERLAP
instead.
// of qualities cannot exceed the PCR error rate. | ||
sumOfQualities += Math.min(qualityRW.draw() + qualityRW.draw(), pcrErrorRate); | ||
} | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to adjust the large number approx code as well? Particularly since you are introducing a PCR base quality cap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't like the large number approx code. It adds complexity to the code, and really doesn't speed it up enough to justify it.
So I'm removing the approximation code.
@@ -319,8 +337,17 @@ private static double sensitivityAtConstantDepth(final int depth, final Histogra | |||
* @param alleleFraction the allele fraction to evaluate sensitivity at | |||
* @return Theoretical sensitivity for the given arguments over a particular depth distribution. | |||
*/ | |||
public static double theoreticalSensitivity(final Histogram<Integer> depthHistogram, final Histogram<Integer> qualityHistogram, | |||
final int sampleSize, final double logOddsThreshold, final double alleleFraction) { | |||
public static double theoreticalSensitivity(final Histogram<Integer> depthHistogram, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to keep this method overload, given that it is only used in tests? I get that it'd be an API change, but not sure how we feel about that. You're changing the API anyway with respect to sensitivityAtConstantDepth
, so probably best to be consistent.
}; | ||
} | ||
|
||
@Test(dataProvider = "TheoreticalSensitivityDataProvider") | ||
public void testSensitivity(final double expected, final File metricsFile, final double alleleFraction, final int sampleSize) throws Exception { | ||
public void testSensitivity(final double expected, final File metricsFile, final double alleleFraction, final int sampleSize, final boolean useOverlapProbability) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some tests that explore a range of different overlap probabilities and pcr error rates
// of qualities cannot exceed the PCR error rate. | ||
sumOfQualities += Math.min(qualityRW.draw() + qualityRW.draw(), pcrErrorRate); | ||
} | ||
// If the number of alt reads is "small" we draw from the actual base quality distribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it looks like you're getting rid of the Gaussian approximation at larger depth it seems that this comment is now moot.
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you profiled this change, in particular when we care about extremely deep data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are a bit long now, I'm trying to resolve the issue.
One thing that bothers me about this model is that it has non-monotonic behavior. As an example: For uniform depth of 20, the sensitivity is 0.76 This is a really big difference, and I really don't expect the theoretical sensitivity to drop by 10% simply by increasing the depth from 20 to 21. I don't think this is a bug in the code, I double checked this in R, and got the same result. The reason seems to be because the threshold for calling a variant increases when the depth increases from 20 to 21. Any thoughts on this? |
…e problem with the model.
Description
This modifies the Theoretical Sensitivity model to account for overlapping reads. Overlapping reads are particularly important at low allele fractions because overlaps allow for the reduction of sequencer error because overlapping bases should be identical, and generally have much higher effective base qualities.
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests