Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error processsing large PAF files from minimap2 #67

Open
brettChapman opened this issue Oct 5, 2020 · 23 comments
Open

Error processsing large PAF files from minimap2 #67

brettChapman opened this issue Oct 5, 2020 · 23 comments

Comments

@brettChapman
Copy link

Hi Erik

I've been running Seqwish with the output from minimap2 and edyeet for comparison, across multiple chromosomes and also testing with multiple whole genomes.

I've found the PAF files from minimap2 to be around twice the size of the PAF files from edyeet. I've tried processing the PAF file from minimap2 which is a 19GB PAF file (from an alignment of chromosome 5H from 4 genomes). However, it ends with a bad_alloc error due to inadequate memory. I'm going to try and run on a larger machine to see if I can get it to complete. The PAF file from Edyeet on the other can be processed with no problems.

I have also found that running edyeet with s50K across 4 whole genomes results in an error (likely a memory issue, as the job was outright killed), but when running with s100K it seems to continue the alignment with no problems, thus far (it's still running after 17 days). Regardless, I fear the resulting PAF file will still be too large to covert into a GFA file using the nodes from my cluster.

In regards to processing large PAF files, would it be conceivable to perform the alignment in parts (like we have discussed before), but instead of merging each PAF file and running the concatenated PAF file through Seqwish, we instead run Seqwish on each smaller PAF file, produce several GFA files, which we then each run through say Smoothxg, then run through a tool like GFAKluge (https://github.com/edawson/gfakluge), to merge all the GFA together, and then run Smoothxg on the final GFA at the end. Would breaking up the process into parts right up to producing the final GFA be a good way of reducing memory overhead? I know Smoothxg should probably be run on the whole GFA, but would iteratively running Smoothxg on several GFAs, and then running it on the final merged smoothed GFAs, ultimately reduce the resource overhead for each individual job? Or would this approach add more issues than it resolves? I see that you're one of the developers of GFAKluge, but it hasn't been updated in a while. Is GFAKluge something worth looking into to add to my pangenome workflow, to help reduce memory overhead? Thanks.

@ekg
Copy link
Owner

ekg commented Oct 5, 2020 via email

@brettChapman
Copy link
Author

Hi Erik

I'm using -n 16 which is all the cores on one of my nodes. Do you suggest I change to -n 2 or -n 4? I'm currently testing with 4 genomes. I do plan to go all the way up to 20 genomes, however I'm doubting it will scale well on a single node, hence why I'm look at different ways to break the problem down and do as much processing across multiple nodes as I possibly can.

On each node I have 64GB RAM.

If you think it would be a good idea, I'll try working with GFAKluge and see how the final output looks. I'll try and rerun edyeet/minimap2 and seqwish with only a few cores and see if I can get them to complete that way first.

A graph to graph alignment sounds very interesting!

@ekg
Copy link
Owner

ekg commented Oct 5, 2020 via email

@brettChapman
Copy link
Author

Hi Erik

Sorry, I said -n 16. I meant to say -t 16.

For the job which is taking over 2 weeks, I ran with these parameters:

edyeet -X -s 100000 -p 75 -n 20 -a 75 -k 11 -t 16 all_genomes.fasta all_genomes.fasta > pangenome.paf

Would you suggest I use -t 2 and -n 3 instead? Perhaps the -n value could be a function of the number of genomes (number of genomes minus 1), and -t the same but minus 2, and take the choice of values for these away from the user, to apply an optimal number to use?

In regards to GFAKluge, I saw one of its functions on it's Github page was to merge GFA, unless the GFA is a different version or format to the GFAs Seqwish generates?

@ekg
Copy link
Owner

ekg commented Oct 5, 2020 via email

@brettChapman
Copy link
Author

Thanks for the explanation. I'll keep -n to the total genome count, and keep -k to 16 and see how it runs again. I'll leave the current running edyeet job going just to see how long it takes.

I'll also try a larger -s value later to see how the run time changes.

In regards to the -s value with aligning smaller regions, what would be an appropriate value for multiple gene sequence regions? I tried running edyeet on an extracted gene region +/- 1000bp either side, and found -s larger than the longest sequence length to kill the job. I then ran -s with the length of the shortest sequence among all the gene sequences, and it ran fine. However there were a few gaps in the alignment compared to a minimap2 job I ran as well. Perhaps I could run -s at half the length of the shortest sequence? As a s100K value for whole genomes/chromosomes would be only a fraction of the roughly 600Mbps chromosome length, so a much shorter -s value (perhaps in the range of 500 to 1000) for aligning gene regions may be more appropriate (for my example gene I ran -s with a value of around 5000)

The currently running edyeet job still hasn't started outputting any final output.

A log with expected time to completion would definitely help a lot with planning and testing out parameters quickly.

@ekg
Copy link
Owner

ekg commented Oct 5, 2020 via email

@brettChapman
Copy link
Author

Thanks. I'll try edyeet with -M and also give wfmash a go and see how the results compare.

@ekg
Copy link
Owner

ekg commented Oct 6, 2020 via email

@brettChapman
Copy link
Author

Thanks! I'll try out the new builds.

In regards to running edyeet with a lower -n value and sticking with -k 16, I just ran edyeet overnight and it failed with 4 genomes:

I ran this:

edyeet -X -s 100000 -p 75 -n 3 -a 75 -k 16 -t 16 all_genomes.fasta all_genomes.fasta > pangenome.paf

The job received this error:

srun: error: node-5: task 0: Killed
srun: Force Terminated job step 2119.1

The job I started up over 2 weeks ago using a -n 20 and -k 11, is still running. I imagine it will eventually get killed off as well.

I'm guessing it's a memory problem. Could I try any other parameters, perhaps a much larger -s value, or do you think at this stage by adding more genomes the problem is simply inadequate memory, and the only way forward is by splitting up the alignments to 2 genomes per job and merging the PAF files is the only way to go? I've concluded as much with my minimap2 runs. I'm currently testing multiple alignment jobs and concatenation of PAF files with minimap2 now. Each genome is roughly 4Gbp.

@ekg
Copy link
Owner

ekg commented Oct 7, 2020 via email

@ekg
Copy link
Owner

ekg commented Oct 8, 2020

I would also start backing off on the parameters. What level of divergence do you expect between the sequences? Set -p and -a a little below that. Using a shorter segment length is probably not a bad idea.

@brettChapman
Copy link
Author

Ok, thanks I'll try with lower parameter values. Perhaps -a 50 -p 50 and -s 50000 or -s 75000.

The divergence between varieties is quite variable. Some varieties are recently diverged being crossed by breeders over the last decade or two, or perhaps more recently, while others are wild-type varieties and may have diverged <10K years ago.

@ekg
Copy link
Owner

ekg commented Oct 8, 2020 via email

@brettChapman
Copy link
Author

Ok, so perhaps -a 90 -p 90 -s 50000 ? I'll submit another job with these values.

@ekg
Copy link
Owner

ekg commented Oct 8, 2020 via email

@brettChapman
Copy link
Author

I'm running jobs now on chromosome 2, across 4 genomes. p99/a99/s50k, p95/a95/s50k, p90/a90/s50k, p90/a90/s100k.

I have a feeling 90% will be the sweet spot. I've just been looking closely at some of the gene regions I've aligned, and some are weird because I pulled in too many genes from other regions from the mmseq2 step, which had a minimum identify of 70% and coverage of 80%. Looking at the results, if I filter at 90% identify, I remove huge numbers of mismatches from around 65 right down to the 10s, and now the gene numbers are 1 per chromosome, although sometimes I do get duplicate genes on the same chromosome in some cases, even with high percent id thresholds. I'm now running my pangene pipeline again with 90% identify and 80% coverage. I'll check the gene alignments again and see if they all look better. Looking at the gene regions of several genes is probably a good indicator of what thresholds to apply at the genome level.

@brettChapman
Copy link
Author

Hi Erik

I now have some results from the chromosome 2H alignment across 4 genomes. I ran smoothxg with spoa (because abPOA is buggy). I then ran ODGI stats, view etc.

Here is a summary on the stats (sum in node space and sum in nucleotide space, obtained from ODGI stats):

2H-edyeet_comparison

p99/a99/s50k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p99_a99_n2_k16 viz

p95/a95/s50k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p95_a95_n2_k16 viz

p90/a90/s50k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p90_a90_n2_k16 viz

p90/a90/s100k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s100000_p90_a90_n2_k16 viz

Here is a tree of the 20 genomes, produced from mashtree. You can see that Morex is closer to Igri than the two others, which is also apparent from the ODGI stats based on the small differences in node/nucleotide space between them.

pangenome_tree_bootstrap_1000rep

Based on these results, I'm leaning towards p90/a90/s50K, as I'd expect the nucleotide/node space to be more similar between Morex and Igri. The alignments from p90/a90/s50K also look less fragmented when compared to the other alignments.

What are your thoughts? Do you think that is convincing evidence?

Thanks.

@AndreaGuarracino
Copy link
Collaborator

Hi Brett,

thank you for your tests and your always detailed posts.

Are you using odgi stats -l/-s to generate those tables, right? It is interesting how you are using them to look at the genomes' relationships. However, those metrics were born to evaluate the goodness of the graphs sorting (in 1D, we are developing them also for 2D evaluation), so they could be not the best way to judge what you would like. I am thinking the case where the genomes are related, but for any/stochastic reasons, their sortings have gone slightly different ways.

@brettChapman
Copy link
Author

Hi Andrea

Thanks.

Yeah, I used -l and -s with odgi stats:

odgi stats -t 16 -i output.og -S -C -l -s -P > output.stats

How would you suggest I determine the most appropriate edyeet alignment parameters to use? All my varieties are of the same species, some diverging recently over the last 10, 20, or 50 years (developed by breeders), and other wild-type varieties diverging <10K years ago since the start of agriculture.

@AndreaGuarracino
Copy link
Collaborator

Hard to say, but I would have followed your same strategy, that is to evaluate which result best reflects the evolutionary relationships that are already known a priori, or that are expected to be observed. Therefore, a metric would be needed to assess paths similarity, such as shared nodes, or a more refined phylogenetic analysis. However, keeping an eye on those metrics (odgi stats -l and -s) is a good thing, but you can also improve them sorting downstream the obtained graphs using odgi sort.

@brettChapman
Copy link
Author

Thanks for the advice. I've now ran it through odgi sort:

p99/a99/s50k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p99_a99_n2_k16 sorted viz

p95/a95/s50k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p95_a95_n2_k16 sorted viz

p90/a90/s50k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p90_a90_n2_k16 sorted viz

p90/a90/s100k alignment:
Morex_Igiri_Akashinriki_Barke_2H_edyeet_s100000_p90_a90_n2_k16 sorted viz

I also redid the odgi stats, and the node space and nucleotide space no longer correlates to the evolutionary relationship between varieties. How can I make good use of odgi stats? Or is it mainly used for testing optimal parameters for graph building?

I saw that odgi view can covert back to GFA again. Is that something worth doing after sorting? Would it improve the graph analysis and visualisation downstream with VG?

In regards to the best edyeet parameters, I think I'll just have to go on how well the alignment looks. I've tested with smaller gene regions as well, and for genes without any huge indels, using -a 90 and -p 90 works well. As for the segment length for small gene regions of several hundred to thousands of bp, I'm trialing using -s 500. For whole chromosomes, I'm thinking -s 50k, as can be seen in the alignments using -s 100k tends to see a more fragmented alignment. I'll try -s 10k and see how it compares, however barley is highly repetitive, so going lower than 50k may not be such a good idea.

@brettChapman
Copy link
Author

brettChapman commented Oct 23, 2020

I've settled on these alignments:

s100k, p90, a90:
Morex_Igiri_Akashinriki_Barke_2H_edyeetParams-s100000-p90-n2-a90-k16_seqwishParams-k8_smoothxgParams-w10000-j5000-k0-e100-l10000 odgi_viz_alignment

s100k, p90, a85:
Morex_Igiri_Akashinriki_Barke_2H_edyeetParams-s100000-p90-n2-a85-k16_seqwishParams-k8_smoothxgParams-w10000-j5000-k0-e100-l10000 odgi_viz_alignment

a100k, p85, p90:
Morex_Igiri_Akashinriki_Barke_2H_edyeetParams-s100000-p85-n2-a90-k16_seqwishParams-k8_smoothxgParams-w10000-j5000-k0-e100-l10000 odgi_viz_alignment

I chose a segment length of 100Kbp due to the fact these are often large chromosomes and the repeat content can be quite high around 80%. I chose either p85/90 and a85/90 as this would be around the percent divergence and the alignments don't appear to be too fragmented. I'm undecided on whether -a should be near or lower than -p, or at the same? On Erik's PGGB readme page he suggests -a should be near or lower than -p, to ensure a base-level alignment for the mapping.

What is typically used? I'd like to get a good base-level mapping and alignment. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants