-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancies between Edyeet and minimap2 PAF output #1
Comments
edyeet and minimap2 are fundamentally different algorithms. Both produce semantically equivalent PAF, but they get there with different approaches. minimap2 starts with minimizer chaining. Chains are built and filtered using heuristic criteria. Then, when edyeet starts with the mashmap2 approximate mapping algorithm, which scans segments of each query across the reference set, scoring each possible mapping location based on mash distance. A top number of mappings are kept (the number of secondaries given by The mapping and alignment phases in edyeet are relatively tightly bounded. Given a fixed segment length (you're using 10kb but experiments suggest that going even higher might be good for pangenomes), the mapping time for each segment across the reference set is approximately the same. Then, we get no more than These tight bounds do not appear to be the case for minimap2. The chaining process does not provide the same kinds of bounds, and it seems that sequences that are repetitive can use dramatically more time and memory to complete this step. The segment boundaries are not fixed, but determined by minimizer chaining heuristics. In result, we may end up aligning many small segments from some sequences and single long ones for others. This in turn can affect the runtime and (possibly) memory requirements of the alignment. I suspect that minimap2 is more likely to give us good alignments. But, it can cost much more. MashMap2, and by consequence edyeet, provide a very fast method for seeing the high-level, large-scale relationships in a set of sequences. I think that this is a good place to start when building a pangenome graph, in that we can later go in and compress or homogenize the alignment using methods for multiple alignment that have affine gaps and such (smoothxg is the first prototype of this). |
Just a second note to keep things organized. I'm still learning how to use edyeet for this application, and I'm finding that a combination of large segment sizes and permissive identity bounds seems to give good results on a collection of yeast genomes. Taking these genomes in yeast.pan.fa.gz, I found that this worked very well:
So if you continue with edyeet, I'd suggest trying larger segment sizes and lower identity thresholds. |
Thanks for the very detailed explanation. I'll try as you suggested with larger segment sizes and lower identity thresholds. My barley genomes are very repetitive, so that would probably explain why it's taking longer, and why the files are much larger with minimap2 by comparison. I agree, it may be more prudent starting at a higher level with edyeet + smoothxg, and then working down. One aspect about a pangenome graph with VG that is attractive is being able to produce variant calls directly from the graph using vg deconstruct. Between a graph built by minimap2 vs edyeet + smoothxg, would the resulting variants in the VCF file be significantly different? Would minimap2 identify more variants by comparison, or could edyeet, assisted with smoothxg, basically make the two graph and resulting variants (vg deconstruct) reasonably comparable? Thanks. |
That's the hope. Please report if you have success with vg deconstruct. I find it to be difficult to get to work on arbitrary graphs, although the theory behind it is clean, it seems to be difficult to get things into VCF space.
It's really not clear to me. It depends on the way that things are parameterized, and there are unfortunately a lot of parameters. I think you just have to try and see how it goes. At this point, it's all blue sky research. Hopefully in time we'll have standard approaches that work in general for most cases. |
Thanks. I'll try first with the parameters you suggest with larger segment lengths, and then adjust as needed. I'll also still try with minimap2 (although I've found it takes around 7x longer to run). I'm also going to time the alignments between 2 chromosomes and then 4 chromosomes using edyeet and minimap2, to compare alignment times, and to see how they would likely scale to all 7 chromosomes and 20 genomes. I'm guessing the alignment times for the whole pangenome will likely turn out huge (up to several months I'm guessing) and I'll have no choice but to split up the alignments across nodes, even though I may miss secondary alignments. It'll be good to know the alignment times, pros/cons moving forward and to justify that choice. I've just upgraded my cluster with 10 more nodes (now 20 nodes) and 100TB of space, so running across many nodes shouldn't be a problem. |
You might be interested in the pangenome graph builder, pggb. It pulls together the steps for pangenome graph construction in one script. |
Hi Erik Thanks. This looks really useful. In my case I'll need to run edyeet across multiple nodes and merge the paf files, but I could include some sbatch jobs in the bash script with dependencies for that. It's a good framework to build a good pangenome pipeline from. |
Hi Erik
I've run Edyeet with 2 chromosomes of the same variety, but different assembly versions (related to the issue I posted for smoothxg: pangenome/smoothxg#2). I also ran with minimap2. However, the resulting size of the PAF file for both is hugely different.
Edyeet run gives me a 37Mb PAF file and also gives an error when running the resulting GFA with smoothxg:
Minimap2 run gives me a 8.3GB PAF file (and it's still running and has been for the last 4 days, compared to the relatively quick completion time of edyeet):
Are the discrepancies in size between the two, mainly based on the parameters, or is minimap2 much more sensitive? Or perhaps Edyeet finished prematurely, and hence why its resulting GFA file gives me an error with smoothxg? Do you generally expect smaller PAF files with Edyeet?
Thanks.
The text was updated successfully, but these errors were encountered: