-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cudamapper] Accuracy improvements through chaining #565
Open
edawson
wants to merge
72
commits into
NVIDIA-Genomics-Research:dev
Choose a base branch
from
edawson:anchmer-fast-score
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… interface for anchmer-based overlapping.
Completes the basic functionality for the generate_anchmers kernel. Simply prints out anchmers currently.
Provides a working (but unfinished) implementation of anchmer-based overlap generation. Filtering and sorting are not yet implemented.
…erlaps. Implements first-round overlap filtering using cub::Flagged and a masking kernel.
Implements a basic overlap fusion procedure on GPU, using cub's NonTrivialRuns and a set of CUDA kernels.
Turns off fusion on in host code and removes the initial filtering mask for short overlaps in an attempt to generate longer intermediate and final overlaps.
…equivalence distance from 150bp to 20bp.
Disables initial filter for short overlaps before chaining anchmers. This improves recall compared to minimap2.
using overlapmers. Implements anchmer chaining using overlapmers, which are successively larger overlap windows. Disables CPU fusion again and runs several rounds of GPU-based overlapmer chaining instead.
set of final overlaps that are only 70% intersection / 40% accurate compared to mm2.
This integrates a new implementation of anchmer-based overlap chaining which incoporates a simple scoring mechanism.
Completes an anchmer-based chaining algorithm (with scoring) that achieves 94% intersection / 68% correctness as reported by PAF-assess.
Changes the == operator of Anchors to prevent fusing two anchors if they have the same query position in read. Two adjacent anchors tend to have the same query_position_in_read when a repeat is present.
Fixes a bug where the number of chains was not set correctly when generating anchmers. This caused many anchmers with full-length chains to be dropped. Also disables repeat masking by RLE for the moment to test the effect of the debugged anchmer chaining.
and [Guo et al](https://vast.cs.ucla.edu/sites/default/files/publications/minimap2-acc-approved.pdf). Implements a transformation of minimap2's chaining algorithm similar to what is used in Guo et al. This involves a forward search (up to N overlaps) and a simple cumulative scoring algorithm. This is actually very similar to the windowed chaining algorithm used for anchmers but does not degrade when encountering repeated seeds (as anchmers and RLE do).
Removes all the cerr debugging output, as PEF logs were exploding to large sizes.
Implements minimap2's scoring algorithm for chaining.
…in scoring function.
…g in overlapper_anchmer.cu. After refactoring the chaining loop to correctly utilize threads, this commit further reduces the amount of work by terminating the predecessor finding process as soon as a match is found (rather than continuing).
…chmer and refactor the initial ID check.
…er_minimap. Switches overlapping to a new overlapper (overlapper_minimap).
…d line options. Implements command line options and functions which filter read overlaps covering the entire read. If `-X` is passed, such overlaps are removed.
Implements chaining using a modified version of minimap2's chaining algorithm.
…es or which are contained. Adds methods and procedure that filters overlaps that appear to be duplicates. Overlaps which are greater than 80% reciprocal overlaps are dropped, as are those which are completely contained within another overlap that occurs at a later index. Accuracy for the E. coli dataset has not improved significantly. However, the drosophila dataset // Overlapper triggered, E. coli //precision recall percent_correct num_correct num_records_mm2 num_records_cudamapper 0.3248175182481752 0.9616427741185587 0.7530217566478646 5607 7743 15126 // Overlapper minimap, E. coli //precision recall percent_correct num_correct num_records_mm2 num_records_cudamapper 0.41585440146207736 0.9120495931809376 0.7732936845086378 5461 7743 10850 //Overlapper triggered, drosophila test //precision recall percent_correct num_correct num_records_mm2 num_records_cudamapper 0.11068702290076336 0.12323943661971831 0.8285714285714286 29 284 7 //Overlapper minimap, drosophila test //precision recall percent_correct num_correct num_records_mm2 num_records_cudamapper 0.9819004524886877 0.852112676056338 0.8966942148760331 217 284 154
predecessor search iterations during chaining. - Reduces the number of search iterations during chaining to 32 (from 64). Minimap2 uses 50 and Guo et al. use 64; however, we may be able to get away with fewer since we have sorted anchors. We'll need to benchmark to find out. - Multiple deletions to clean up code.
…ins. Fixes local indexing issues when masking anchors that are not chain terminators. However, this is still not the most efficient way of doing traceback.
Qtpairs chainer
…ixes squash changes
…trace in chainerutils.
[cudamapper] Remove OverlapperMinimap test file, refactor to use back…
…f add support for writing intermediate results from the ith tile to the i+1th tile
…ugh backtrace 2. add debug code for backtrace by serializing backtrace and cpu or gpu
…-recall-improvements 1. precision/recall improvements 2. removed extra syncthreads 3. othe…
…ntation that processes tiles from a given read sequentially.
…he end of a chain so that it may be used in the next tile when running chain_anchors_in_tile.
…_size+1 when processing next read tile. Turn off postprocessing of overlaps.
…t properly placing CUB work in a cuda_stream.
… an entire read. Move functions to grid-stride loops with a madro-defined number of blocks/threads where possible.
…verlaps and threshold to >= 0.9
Tiler chainer
…crease block_count in overlapper_minimap. The reduction in block size seems to increase recall for t.fa, which is indicative of unstable behavior.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In theory, this branch will help push our accuracy closer to that of minimap2 through changes in the chaining and scoring algorithms. It also includes some filtering improvements (which may be broken out into a separate PR.