Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for overlapping reads in Theoretical Sensitivity model #1802

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

fleharty
Copy link
Contributor

Description

This modifies the Theoretical Sensitivity model to account for overlapping reads. Overlapping reads are particularly important at low allele fractions because overlaps allow for the reduction of sequencer error because overlapping bases should be identical, and generally have much higher effective base qualities.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@fleharty
Copy link
Contributor Author

@kachulis Would you have a moment to take a look at this?

@davidbenjamin This might be of interest to you since you implemented the original TheoreticalHetSensitivity model.

@kachulis kachulis self-requested a review April 27, 2022 14:37
src/main/java/picard/analysis/TheoreticalSensitivity.java Outdated Show resolved Hide resolved
log.info("Calculating theoretical sensitivity at " + ALLELE_FRACTION.size() + " allele fractions.");

List<TheoreticalSensitivityMetrics> tsm = TheoreticalSensitivity.calculateSensitivities(SAMPLE_SIZE,
collector.getUnfilteredDepthHistogram(), collector.getUnfilteredBaseQHistogram(), ALLELE_FRACTION, collector.basesExcludedByOverlap, PCR_ERROR_RATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collector.basesExcludedByOverlap needs to be divided by total bases to get fraction overlapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I think this is fixed by using collector.getMetrics(dupeFilter, adapterFilter, mapqFilter, pairFilter).PCT_EXC_OVERLAP instead.

src/main/java/picard/analysis/TheoreticalSensitivity.java Outdated Show resolved Hide resolved
// of qualities cannot exceed the PCR error rate.
sumOfQualities += Math.min(qualityRW.draw() + qualityRW.draw(), pcrErrorRate);
}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to adjust the large number approx code as well? Particularly since you are introducing a PCR base quality cap.

Copy link
Contributor Author

@fleharty fleharty Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't like the large number approx code. It adds complexity to the code, and really doesn't speed it up enough to justify it.

So I'm removing the approximation code.

src/main/java/picard/analysis/CollectWgsMetrics.java Outdated Show resolved Hide resolved
src/test/java/picard/IntelInflaterDeflaterLoadTest.java Outdated Show resolved Hide resolved
@@ -319,8 +337,17 @@ private static double sensitivityAtConstantDepth(final int depth, final Histogra
* @param alleleFraction the allele fraction to evaluate sensitivity at
* @return Theoretical sensitivity for the given arguments over a particular depth distribution.
*/
public static double theoreticalSensitivity(final Histogram<Integer> depthHistogram, final Histogram<Integer> qualityHistogram,
final int sampleSize, final double logOddsThreshold, final double alleleFraction) {
public static double theoreticalSensitivity(final Histogram<Integer> depthHistogram,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to keep this method overload, given that it is only used in tests? I get that it'd be an API change, but not sure how we feel about that. You're changing the API anyway with respect to sensitivityAtConstantDepth, so probably best to be consistent.

};
}

@Test(dataProvider = "TheoreticalSensitivityDataProvider")
public void testSensitivity(final double expected, final File metricsFile, final double alleleFraction, final int sampleSize) throws Exception {
public void testSensitivity(final double expected, final File metricsFile, final double alleleFraction, final int sampleSize, final boolean useOverlapProbability) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some tests that explore a range of different overlap probabilities and pcr error rates

// of qualities cannot exceed the PCR error rate.
sumOfQualities += Math.min(qualityRW.draw() + qualityRW.draw(), pcrErrorRate);
}
// If the number of alt reads is "small" we draw from the actual base quality distribution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it looks like you're getting rid of the Gaussian approximation at larger depth it seems that this comment is now moot.

}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you profiled this change, in particular when we care about extremely deep data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are a bit long now, I'm trying to resolve the issue.

@fleharty
Copy link
Contributor Author

fleharty commented Jun 23, 2022

@davidbenjamin @kachulis

One thing that bothers me about this model is that it has non-monotonic behavior.
There are cases where increasing the depth causes a loss of sensitivity. I think this is really counter-intuitive and probably actually wrong, so it makes me wonder if there is a problem with the model.

As an example:
logOddsThreshold = 10
alleleFraction = 0.3
baseQualities = 30 (not a histogram, just all base qualities are q30)

For uniform depth of 20, the sensitivity is 0.76
For uniform depth of 21, the sensitivity is 0.64

This is a really big difference, and I really don't expect the theoretical sensitivity to drop by 10% simply by increasing the depth from 20 to 21.

I don't think this is a bug in the code, I double checked this in R, and got the same result. The reason seems to be because the threshold for calling a variant increases when the depth increases from 20 to 21.

Any thoughts on this?

$\sum{_{k=3}^n} {n \choose k} p^k (1 - p)^{n - k}$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants