-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: speed improvements for primer pair hit building from single primer hits #99
Conversation
… hit reference name and left/right primer orientation
WalkthroughThe pull request modifies the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
tests/offtarget/test_offtarget.py (1)
240-284
: Update parameter names for clarityChanging parameter names from
left
andright
topositive
andnegative
improves clarity regarding strand orientation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
prymer/offtarget/offtarget_detector.py
(4 hunks)tests/offtarget/test_offtarget.py
(2 hunks)
🔇 Additional comments (2)
prymer/offtarget/offtarget_detector.py (1)
497-500
: Validate hit strands before processing
The checks ensure positive hits are on the positive strand and negative hits on the negative strand, which is good. No issues here.
tests/offtarget/test_offtarget.py (1)
175-235
: Well-structured test for _build_off_target_result
The new test test_build_off_target_result
is comprehensive and correctly verifies the function's behavior with various hits.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #99 +/- ##
==========================================
+ Coverage 96.66% 96.73% +0.07%
==========================================
Files 26 26
Lines 1708 1746 +38
Branches 333 349 +16
==========================================
+ Hits 1651 1689 +38
Misses 31 31
Partials 26 26 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
prymer/offtarget/offtarget_detector.py (1)
477-506
: Enhance error messages with hit details.Error messages could be more specific by including which hits are invalid.
Apply this diff:
- raise ValueError("Positive hits must be on the positive strand.") + invalid_hits = [h for h in positive_hits if h.negative] + raise ValueError(f"Found {len(invalid_hits)} positive hits on negative strand: {invalid_hits}") - raise ValueError("Negative hits must be on the negative strand.") + invalid_hits = [h for h in negative_hits if not h.negative] + raise ValueError(f"Found {len(invalid_hits)} negative hits on positive strand: {invalid_hits}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
prymer/offtarget/offtarget_detector.py
(4 hunks)
🔇 Additional comments (2)
prymer/offtarget/offtarget_detector.py (2)
130-145
: LGTM! Well-structured dataclass.
Clean implementation with appropriate field defaults and documentation.
359-365
: Add early return after exceeding primer hit threshold.
Currently, when primer hits exceed the maximum allowed, the function sets passes=False
but continues processing. To avoid unnecessary computation, return the result immediately after setting passes=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
prymer/offtarget/offtarget_detector.py (1)
479-522
: Consider pre-filtering hits before product operationCurrent implementation is good, but could be optimized further.
Consider filtering hits by position before using itertools.product to reduce unnecessary combinations:
- for positive_hit, negative_hit in itertools.product(positive_hits, negative_hits): + # Pre-filter hits that could never form valid amplicons + filtered_negative_hits = [h for h in negative_hits if h.start > min(p.end for p in positive_hits)] + filtered_positive_hits = [h for h in positive_hits if h.end < max(n.start for n in negative_hits)] + for positive_hit, negative_hit in itertools.product(filtered_positive_hits, filtered_negative_hits):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
prymer/offtarget/offtarget_detector.py
(4 hunks)tests/offtarget/test_offtarget.py
(2 hunks)
🔇 Additional comments (5)
tests/offtarget/test_offtarget.py (3)
175-234
: LGTM: Comprehensive test coverage for off-target result building
Test effectively validates hit organization by reference and strand orientation.
240-268
: LGTM: Test cases cover critical scenarios
Test matrix effectively validates primer orientation and amplicon size constraints.
288-326
: LGTM: Error cases well covered
Comprehensive validation of error conditions for invalid primer configurations.
prymer/offtarget/offtarget_detector.py (2)
130-145
: LGTM: Well-structured data organization
Clean dataclass implementation for hit organization.
356-423
: LGTM: Improved efficiency with early exit and structured hit organization
Clean implementation with proper hit organization and early termination.
if ( | ||
left_bwa_result.hit_count > self._max_primer_hits | ||
or right_bwa_result.hit_count > self._max_primer_hits | ||
): | ||
result = OffTargetResult(primer_pair=primer_pair, passes=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just return here? Is there any value to caching the result in this case, where it's so little work to compute the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching is required by the unit test at
prymer/tests/offtarget/test_offtarget.py
Line 99 in 2656adf
# Test that using the cache (or not) does not affect the results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Perhaps I'll "fix" this in a separate PR.
[self._hit_to_span(h) for h in p2.hits] if self._keep_primer_spans else [] | ||
), | ||
if self._cache_results: | ||
self._primer_pair_cache[primer_pair] = replace(result, cached=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If keeping this (and again on line 408), why not just set cached=self._cache_results
at construction time, so there's no need to replace(cached=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic is that the returned object should have cached=False
, because it's the first time it's been generated, and the cached object has cached=True
for when it's retrieved from the cache. This jives with the description of cached
at
prymer/prymer/offtarget/offtarget_detector.py
Line 111 in 2656adf
cached: True if this result is part of a cache, False otherwise. This is useful for testing |
cached: True if this result is part of a cache, False otherwise. This is useful for testing
I suggest we come back to the cache issue on a subsequent PR.
# Get the set of reference names with hits | ||
hits_by_refname: dict[str, PrimerPairBwaHitsBySideAndStrand] = { | ||
hit.refname: PrimerPairBwaHitsBySideAndStrand() | ||
for hit in left_bwa_result.hits + right_bwa_result.hits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These feels pretty inefficient .. it will concatenate two fairly large lists and (I think?) generate a PrimerPairBwaHitsBySideAndStrand
per hit, then unique them, rather than generate one per reference contig.
I think something like this would work:
for refname in set(hit.refname for hit in itertools.chain(left_bwa_result.hits, right_bwa_result.hits)
for hit in left_bwa_result.hits: | ||
if hit.negative: | ||
hits_by_refname[hit.refname].left_negative.append(hit) | ||
else: | ||
hits_by_refname[hit.refname].left_positive.append(hit) | ||
|
||
for hit in right_bwa_result.hits: | ||
if hit.negative: | ||
hits_by_refname[hit.refname].right_negative.append(hit) | ||
else: | ||
hits_by_refname[hit.refname].right_positive.append(hit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see this ... I think actually maybe you don't care any more about left vs. right here and beyond this point? Sorry, as I think I may have sent you down this path.
Specifically if a single left primer happens to bind to the genome in two places, close together, in the right relative orientations, that could still cause problems. So I think it's really just one collection of + strand mappings and one collection of - strand mappings.
positive_hits=hits.left_positive, | ||
negative_hits=hits.right_negative, | ||
max_len=self._max_amplicon_size, | ||
strand=Strand.POSITIVE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to think about strand here. I think it's probably mostly irrelevant and could just always be set to positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PrimerPair
code, it's not explicitly enforced by a __post_init__
but is strongly assumed by e.g. calculate_amplicon_span that the left primer is left of the right primer. This implies that the left primer is on the positive strand and the right primer is on the negative strand. All amplicons are coded as on the positive strand (default value for Span
, again, not explicit).
When a hit to a primer pair is in the same orientation as the primer pair definition, e.g. left positive and right negative, then it makes sense for the amplicon to also be on the positive strand.
If the hit is in the opposite orientation to the primer pair definition, the amplicon should be on the negative strand.
I think it makes sense to assign strandedness like this:
- left primer + / right primer - = amplicon +
- left primer + / left primer - = amplicon +
- right primer + / left primer - = amplicon -
- right primer + / right primer - = amplicon -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed elsewhere, will leave questions of strand and L/L R/R hits to a separate PR.
plus, minus = (h2, h1) if h1.negative else (h1, h2) | ||
if minus.start > plus.end and (minus.end - plus.start + 1) <= max_len: | ||
amplicons.append(Span(refname=plus.refname, start=plus.start, end=minus.end)) | ||
refnames: set[str] = {h.refname for h in positive_hits + negative_hits} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refnames: set[str] = {h.refname for h in positive_hits + negative_hits} | |
refnames: set[str] = {h.refname for h in itertools.chain(positive_hits, | |
negative_hits)} |
amplicons.append(Span(refname=plus.refname, start=plus.start, end=minus.end)) | ||
refnames: set[str] = {h.refname for h in positive_hits + negative_hits} | ||
if len(refnames) > 1: | ||
raise ValueError("Hits are present on more than one reference.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include the refnames here?
negative_hit.start > positive_hit.end | ||
and (negative_hit.end - positive_hit.start + 1) <= max_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow for situations where the two primers' mappings overlap on the genome? E.g. maybe the check should be:
negative_hit.start > positive_hit.start and negative_hit.end > positive_hit.end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable. Do you think we'd ever see a situation where a primer is a palindrome and hits the same target both forward and reverse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change causes a number of tests to fail. Since it's a breaking change, I'd like to address it in a separate PR.
FAILED prymer/offtarget/offtarget_detector.py::prymer.offtarget.offtarget_detector
FAILED tests/offtarget/test_offtarget.py::test_ok[True] - AssertionError: assert 1187 == 856
FAILED tests/offtarget/test_offtarget.py::test_ok[False] - AssertionError: assert 1187 == 856
FAILED tests/offtarget/test_offtarget.py::test_check_too_many_primer_pair_hits[at the maximum-204-856-True-True] - AssertionError: at the maximum
FAILED tests/offtarget/test_offtarget.py::test_check_too_many_primer_pair_hits[at the maximum-204-856-True-False] - AssertionError: at the maximum
FAILED tests/offtarget/test_offtarget.py::test_to_amplicons[No mappings - overlapping primers (1bp overlap)-positive0-negative0-+-expected0-True] - AssertionError: No mappings - overlapping primers (1bp overlap)
FAILED tests/offtarget/test_offtarget.py::test_to_amplicons[No mappings - overlapping primers (1bp overlap)-positive0-negative0-+-expected0-False] - AssertionError: No mappings - overlapping primers (1bp overlap)
FAILED tests/offtarget/test_offtarget.py::test_to_amplicons[No mappings - amplicon size too big (1bp too big)-positive1-negative1-+-expected1-True] - AssertionError: No mappings - amplicon size too big (1bp too big)
FAILED tests/offtarget/test_offtarget.py::test_to_amplicons[No mappings - amplicon size too big (1bp too big)-positive1-negative1-+-expected1-False] - AssertionError: No mappings - amplicon size too big (1bp too big)
raise ValueError("Hits are present on more than one reference.") | ||
|
||
amplicons: list[Span] = [] | ||
for positive_hit, negative_hit in itertools.product(positive_hits, negative_hits): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough if it meets your performance needs @ameynert. As noted elsewhere, if you had many many hits to one reference, you could speed this up by sorting the positive and negative strand hits, looping over list indices, and avoiding whole swathes of pairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this separately and the speed improvement by sorting the pos/neg strand hits is significant when there are a large number of hits returned, e.g. in a scenario where the max mismatches parameters are relaxed, one of the primer sequences is getting 100+ hits, and there are multiple highly similar reference contigs in the indexed genome that's being searched.
… reference in _to_amplicons
…g left/right hits by strand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
prymer/offtarget/offtarget_detector.py (1)
356-378
: Consider using dict comprehensions for hit organization.Current implementation uses multiple loops to organize hits.
- left_positive_hits: defaultdict[ReferenceName, list[BwaHit]] = defaultdict(list) - left_negative_hits: defaultdict[ReferenceName, list[BwaHit]] = defaultdict(list) - right_positive_hits: defaultdict[ReferenceName, list[BwaHit]] = defaultdict(list) - right_negative_hits: defaultdict[ReferenceName, list[BwaHit]] = defaultdict(list) - - # Split the hits for left and right by reference name and strand - for hit in left_bwa_result.hits: - if hit.negative: - left_negative_hits[hit.refname].append(hit) - else: - left_positive_hits[hit.refname].append(hit) - - for hit in right_bwa_result.hits: - if hit.negative: - right_negative_hits[hit.refname].append(hit) - else: - right_positive_hits[hit.refname].append(hit) + left_hits = {True: defaultdict(list), False: defaultdict(list)} + right_hits = {True: defaultdict(list), False: defaultdict(list)} + + for hit in left_bwa_result.hits: + left_hits[hit.negative][hit.refname].append(hit) + + for hit in right_bwa_result.hits: + right_hits[hit.negative][hit.refname].append(hit) + + left_positive_hits = left_hits[False] + left_negative_hits = left_hits[True] + right_positive_hits = right_hits[False] + right_negative_hits = right_hits[True]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
prymer/offtarget/offtarget_detector.py
(5 hunks)
🔇 Additional comments (5)
prymer/offtarget/offtarget_detector.py (5)
104-106
: LGTM: Type alias improves code clarity.
The ReferenceName
type alias enhances code readability and maintainability.
343-355
: LGTM: Early return optimization.
The early return after exceeding primer hit threshold is correctly implemented.
379-396
: LGTM: Amplicon building logic is correct.
The code correctly handles both positive and negative strand amplicons.
399-414
: LGTM: Result construction is well-structured.
The code properly handles spans based on configuration flags.
470-515
: Consider optimizing amplicon generation for large hit sets.
As noted in past reviews, sorting hits could improve performance when dealing with many hits on a single reference.
amplicons: list[Span] = []
+ # Sort hits by position for efficient iteration
+ positive_hits.sort(key=lambda h: h.start)
+ negative_hits.sort(key=lambda h: h.start)
+
for positive_hit, negative_hit in itertools.product(positive_hits, negative_hits):
if (
negative_hit.start > positive_hit.end
and negative_hit.end - positive_hit.start + 1 <= max_len
):
+ # Can break inner loop if negative_hit.start is too far
+ if negative_hit.start - positive_hit.start + 1 > max_len:
+ break
amplicons.append(
Span(
refname=positive_hit.refname,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
prymer/offtarget/offtarget_detector.py (1)
514-549
: Consider further optimizations.
- Current implementation has O(n*m) complexity where n and m are the number of positive and negative hits.
- TODO about overlapping hits should be addressed.
Consider:
- Using binary search to find valid negative hits
- Implementing parallel processing for large hit sets
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
prymer/offtarget/offtarget_detector.py
(5 hunks)tests/offtarget/test_offtarget.py
(2 hunks)
🔇 Additional comments (6)
prymer/offtarget/offtarget_detector.py (5)
Line range hint 78-106
: LGTM: Clean import additions and type alias.
The new imports and type alias enhance code organization and type safety.
343-413
: LGTM: Improved hit organization and early return optimization.
Good improvements:
- Early return when hit count exceeds maximum
- Structured organization of hits by reference and strand
470-485
: LGTM: Clear method signature update.
Parameters are well-named and documented.
486-501
: LGTM: Thorough input validation.
Good error handling for strand orientation and reference consistency.
502-506
: LGTM: Smart early exit optimization.
Prevents unnecessary sorting when no amplicons can be generated.
tests/offtarget/test_offtarget.py (1)
175-326
: LGTM: Comprehensive test coverage.
Tests thoroughly validate:
- Off-target result building
- Strand handling
- Error conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
prymer/offtarget/offtarget_detector.py (1)
525-525
: Consider implementing overlapping hits support.The TODO suggests allowing overlapping positive and negative hits. This could improve sensitivity for certain primer designs.
Would you like me to propose an implementation that handles overlapping hits?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
prymer/offtarget/offtarget_detector.py
(5 hunks)
🔇 Additional comments (5)
prymer/offtarget/offtarget_detector.py (5)
104-106
: LGTM: Type alias improves code clarity.
The ReferenceName
type alias enhances code readability by providing semantic meaning to reference name strings.
343-354
: LGTM: Early return optimization implemented.
Early return when hit count exceeds maximum prevents unnecessary computation.
357-377
: LGTM: Efficient hit categorization.
Using defaultdict for hit categorization by reference and strand improves code organization and reduces boilerplate.
379-413
: LGTM: Comprehensive amplicon building.
Correctly handles both strand orientations while maintaining proper strand information in the output.
470-514
: LGTM: Efficient implementation with proper validation.
The method correctly validates hit orientations and uses sorting for efficient processing.
if ( | ||
left_bwa_result.hit_count > self._max_primer_hits | ||
or right_bwa_result.hit_count > self._max_primer_hits | ||
): | ||
result = OffTargetResult(primer_pair=primer_pair, passes=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Perhaps I'll "fix" this in a separate PR.
Closes #94
OffTargetDetector._to_amplicons
signature changes fromto
strand
is used to set thestrand
attribute on each of the returnedSpan
s for the amplicon hits.The existing unit test for
OffTargetDetector.to_amplicons
is split into two, one where no error should be raised, and one where aValueError
should be raised, with expected error messages.OffTargetDetector._build_off_target_result
does not change signature.pass = False
to the returnedOffTargetResult
if there are too many hits for either the left or the right primer._to_amplicons
, along with a known amplicon strand - positive when the positive hits are from the left primer of a pair and the negative hits are from the right primer, and negative in the inverse.OffTargetDetector._keep_primer_spans
isTrue
.OffTargetDetector._build_off_target_result
is added.