-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: phg build-kmer-index #266
Comments
Hello, This error is being thrown because we keep track of a DiscardSet of Kmers that are too repetitive. I suspect because you are working in wheat that you have a lot of Kmers and you are running close to the limits with a lot of traditional data structures. If a kmer is seen more than .5 * numSamples it will be added into this set. Right now this is hardcoded but I can probably open up the parameter so you can relax this to allow for more repetitive-ness. It looks like I might be able to use a different data structure at the expense of RAM/speed but it should be able to handle more Kmers. I will integrate these two things here shortly and will respond back to this when that version of the code is ready. Just a heads up we are also working on integrating a different alignment process using the ropebwt3 aligner. This should be ready to use in the near future. |
Would it be a good idea to use repeate masked dna sequence? I could use either
|
I tried the .189 release with and without the --use-big-discard-set option and it still crashes, maybe in different part of the code. Exception in thread "main" java.lang.IllegalArgumentException: Too large (805306401 expected elements with load factor 0.75) |
I think I am also running into a similar issue with Exception in thread "main" java.lang.OutOfMemoryError: Java heap space |
@ClayBirkett It looks like you are now filling up the kmer -> HapIdSet map now. You could set --max-sample-proportion to something lower than the default of .5 but that may heavily change your mapping rates. If you lower this parameter more Kmers will be put in the discard set. The other option is to adjust the kmer mask so it only looks at fewer possibilities of Kmers. By default this only retains min Kmers that end in C but we can adjust it to be ending in CC which would limit things by about 4x which may help with this. To do this set --hashMask to 15 and --hashFilter to 5. These values should work correctly and may relieve a lot of the issues you and @jesse-hill are seeing. We are also close to opening up an alternative alignment approach using the ropebwt3 tool. This should work with any sized of genome(they have indexed terabases) fairly efficiently. We have things initially coded up and in a Pull Request which I will be merging in shortly. I will also need to write some basic documentation to show you how to run it. We have not done full tests with this and it is in a 'beta' stage but it might be worth trying out to see if it at least fixes the issues you are seeing. |
What is the significance of a kmer ending in C or CC? |
It is mainly to reduce the total number of Kmers. With a 32-mer we tend to see too many Kmers to fit in standard primitive data structures. By only using ones that end in C(effectively making it a 31-mer with a C added on) it reduces it by 4x which helps with RAM without purging non informative Kmers. |
Description
program crashed with the following error message. The scaffold it was working on is part of an assembly with normal chromosomes that have a few scaffolds added onto the end of the fasta file.
[main] INFO net.maizegenetics.phgv2.utils.SeqUtils 2024-12-30 20:46:11,076: queryAgc: finished chrom scaffold_v5_415-1
Exception in thread "main" java.lang.IllegalArgumentException: Too large (805306401 expected elements with load factor 0.75)
at it.unimi.dsi.fastutil.HashCommon.arraySize(HashCommon.java:208)
at it.unimi.dsi.fastutil.longs.LongOpenHashSet.add(LongOpenHashSet.java:406)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.countKmerHashesForHaplotypeSequenceSimplified(BuildKmerIndex.kt:341)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.processGraphKmers(BuildKmerIndex.kt:184)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.run(BuildKmerIndex.kt:99)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:38)
Expected behavior
No response
PHG version
phg version 2.4.33.188
The text was updated successfully, but these errors were encountered: