Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: phg build-kmer-index #266

Open
ClayBirkett opened this issue Dec 31, 2024 · 7 comments
Open

[BUG]: phg build-kmer-index #266

ClayBirkett opened this issue Dec 31, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@ClayBirkett
Copy link

Description

program crashed with the following error message. The scaffold it was working on is part of an assembly with normal chromosomes that have a few scaffolds added onto the end of the fasta file.

[main] INFO net.maizegenetics.phgv2.utils.SeqUtils 2024-12-30 20:46:11,076: queryAgc: finished chrom scaffold_v5_415-1
Exception in thread "main" java.lang.IllegalArgumentException: Too large (805306401 expected elements with load factor 0.75)
at it.unimi.dsi.fastutil.HashCommon.arraySize(HashCommon.java:208)
at it.unimi.dsi.fastutil.longs.LongOpenHashSet.add(LongOpenHashSet.java:406)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.countKmerHashesForHaplotypeSequenceSimplified(BuildKmerIndex.kt:341)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.processGraphKmers(BuildKmerIndex.kt:184)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.run(BuildKmerIndex.kt:99)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:38)

Expected behavior

No response

PHG version

phg version 2.4.33.188

@ClayBirkett ClayBirkett added the bug Something isn't working label Dec 31, 2024
@zrm22
Copy link
Collaborator

zrm22 commented Dec 31, 2024

Hello,

This error is being thrown because we keep track of a DiscardSet of Kmers that are too repetitive. I suspect because you are working in wheat that you have a lot of Kmers and you are running close to the limits with a lot of traditional data structures.

If a kmer is seen more than .5 * numSamples it will be added into this set. Right now this is hardcoded but I can probably open up the parameter so you can relax this to allow for more repetitive-ness.

It looks like I might be able to use a different data structure at the expense of RAM/speed but it should be able to handle more Kmers.

I will integrate these two things here shortly and will respond back to this when that version of the code is ready.

Just a heads up we are also working on integrating a different alignment process using the ropebwt3 aligner. This should be ready to use in the near future.

@ClayBirkett
Copy link
Author

Would it be a good idea to use repeate masked dna sequence? I could use either

  • 'dna' - unmasked genomic DNA sequences.
  • 'dna_rm' - masked genomic DNA. Interspersed repeats and low
    complexity regions are detected with the RepeatMasker tool and masked
    by replacing repeats with 'N's.
  • 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions
    have been replaced with lowercased versions of their nucleic base

@ClayBirkett
Copy link
Author

I tried the .189 release with and without the --use-big-discard-set option and it still crashes, maybe in different part of the code.

Exception in thread "main" java.lang.IllegalArgumentException: Too large (805306401 expected elements with load factor 0.75)
at it.unimi.dsi.fastutil.HashCommon.arraySize(HashCommon.java:208)
at it.unimi.dsi.fastutil.longs.Long2ObjectOpenHashMap.insert(Long2ObjectOpenHashMap.java:255)
at it.unimi.dsi.fastutil.longs.Long2ObjectOpenHashMap.put(Long2ObjectOpenHashMap.java:263)
at it.unimi.dsi.fastutil.longs.Long2ObjectFunction.put(Long2ObjectFunction.java:124)
at it.unimi.dsi.fastutil.longs.Long2ObjectMap.put(Long2ObjectMap.java:170)
at it.unimi.dsi.fastutil.longs.Long2ObjectMap.put(Long2ObjectMap.java:41)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.countKmerHashesForHaplotypeSequenceSimplified(BuildKmerIndex.kt:360)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.processGraphKmers(BuildKmerIndex.kt:199)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.run(BuildKmerIndex.kt:108)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:38)

@jesse-hill
Copy link

I think I am also running into a similar issue with phg build-kmer-index, although the error is slightly different. I either get a "Killed" message after running out of memory, or what is below. I set my -Xmx memory to as high as my machine will allow (~450Gb), I am also trying to store a lot of repetitive kmers and I'm running through my available memory quickly. I'm using phg version 2.4.28.183. I'll also try the --use-big-discard-set and see if that helps.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.LinkedHashMap.newNode(LinkedHashMap.java:257)
at java.base/java.util.HashMap.putVal(HashMap.java:629)
at java.base/java.util.HashMap.put(HashMap.java:610)
at java.base/java.util.HashSet.add(HashSet.java:221)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.countKmerHashesForHaplotypeSequenceSimplified(BuildKmerIndex.kt:339)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.processGraphKmers(BuildKmerIndex.kt:180)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.run(BuildKmerIndex.kt:95)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:37)

@zrm22
Copy link
Collaborator

zrm22 commented Jan 10, 2025

@ClayBirkett It looks like you are now filling up the kmer -> HapIdSet map now. You could set --max-sample-proportion to something lower than the default of .5 but that may heavily change your mapping rates. If you lower this parameter more Kmers will be put in the discard set.

The other option is to adjust the kmer mask so it only looks at fewer possibilities of Kmers. By default this only retains min Kmers that end in C but we can adjust it to be ending in CC which would limit things by about 4x which may help with this.

To do this set --hashMask to 15 and --hashFilter to 5. These values should work correctly and may relieve a lot of the issues you and @jesse-hill are seeing.

We are also close to opening up an alternative alignment approach using the ropebwt3 tool. This should work with any sized of genome(they have indexed terabases) fairly efficiently. We have things initially coded up and in a Pull Request which I will be merging in shortly. I will also need to write some basic documentation to show you how to run it. We have not done full tests with this and it is in a 'beta' stage but it might be worth trying out to see if it at least fixes the issues you are seeing.

@ClayBirkett
Copy link
Author

What is the significance of a kmer ending in C or CC?

@zrm22
Copy link
Collaborator

zrm22 commented Jan 10, 2025

It is mainly to reduce the total number of Kmers. With a 32-mer we tend to see too many Kmers to fit in standard primitive data structures. By only using ones that end in C(effectively making it a 31-mer with a C added on) it reduces it by 4x which helps with RAM without purging non informative Kmers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants