[BUG]: phg build-kmer-index #266

ClayBirkett · 2024-12-31T15:03:49Z

Description

program crashed with the following error message. The scaffold it was working on is part of an assembly with normal chromosomes that have a few scaffolds added onto the end of the fasta file.

[main] INFO net.maizegenetics.phgv2.utils.SeqUtils 2024-12-30 20:46:11,076: queryAgc: finished chrom scaffold_v5_415-1
Exception in thread "main" java.lang.IllegalArgumentException: Too large (805306401 expected elements with load factor 0.75)
at it.unimi.dsi.fastutil.HashCommon.arraySize(HashCommon.java:208)
at it.unimi.dsi.fastutil.longs.LongOpenHashSet.add(LongOpenHashSet.java:406)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.countKmerHashesForHaplotypeSequenceSimplified(BuildKmerIndex.kt:341)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.processGraphKmers(BuildKmerIndex.kt:184)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.run(BuildKmerIndex.kt:99)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:38)

Expected behavior

No response

PHG version

phg version 2.4.33.188

zrm22 · 2024-12-31T15:28:11Z

Hello,

This error is being thrown because we keep track of a DiscardSet of Kmers that are too repetitive. I suspect because you are working in wheat that you have a lot of Kmers and you are running close to the limits with a lot of traditional data structures.

If a kmer is seen more than .5 * numSamples it will be added into this set. Right now this is hardcoded but I can probably open up the parameter so you can relax this to allow for more repetitive-ness.

It looks like I might be able to use a different data structure at the expense of RAM/speed but it should be able to handle more Kmers.

I will integrate these two things here shortly and will respond back to this when that version of the code is ready.

Just a heads up we are also working on integrating a different alignment process using the ropebwt3 aligner. This should be ready to use in the near future.

ClayBirkett · 2025-01-02T13:12:44Z

Would it be a good idea to use repeate masked dna sequence? I could use either

'dna' - unmasked genomic DNA sequences.
'dna_rm' - masked genomic DNA. Interspersed repeats and low
complexity regions are detected with the RepeatMasker tool and masked
by replacing repeats with 'N's.
'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions
have been replaced with lowercased versions of their nucleic base

ClayBirkett · 2025-01-04T13:28:42Z

I tried the .189 release with and without the --use-big-discard-set option and it still crashes, maybe in different part of the code.

Exception in thread "main" java.lang.IllegalArgumentException: Too large (805306401 expected elements with load factor 0.75)
at it.unimi.dsi.fastutil.HashCommon.arraySize(HashCommon.java:208)
at it.unimi.dsi.fastutil.longs.Long2ObjectOpenHashMap.insert(Long2ObjectOpenHashMap.java:255)
at it.unimi.dsi.fastutil.longs.Long2ObjectOpenHashMap.put(Long2ObjectOpenHashMap.java:263)
at it.unimi.dsi.fastutil.longs.Long2ObjectFunction.put(Long2ObjectFunction.java:124)
at it.unimi.dsi.fastutil.longs.Long2ObjectMap.put(Long2ObjectMap.java:170)
at it.unimi.dsi.fastutil.longs.Long2ObjectMap.put(Long2ObjectMap.java:41)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.countKmerHashesForHaplotypeSequenceSimplified(BuildKmerIndex.kt:360)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.processGraphKmers(BuildKmerIndex.kt:199)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.run(BuildKmerIndex.kt:108)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:38)

jesse-hill · 2025-01-08T17:40:46Z

I think I am also running into a similar issue with phg build-kmer-index, although the error is slightly different. I either get a "Killed" message after running out of memory, or what is below. I set my -Xmx memory to as high as my machine will allow (~450Gb), I am also trying to store a lot of repetitive kmers and I'm running through my available memory quickly. I'm using phg version 2.4.28.183. I'll also try the --use-big-discard-set and see if that helps.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.LinkedHashMap.newNode(LinkedHashMap.java:257)
at java.base/java.util.HashMap.putVal(HashMap.java:629)
at java.base/java.util.HashMap.put(HashMap.java:610)
at java.base/java.util.HashSet.add(HashSet.java:221)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.countKmerHashesForHaplotypeSequenceSimplified(BuildKmerIndex.kt:339)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.processGraphKmers(BuildKmerIndex.kt:180)
at net.maizegenetics.phgv2.pathing.BuildKmerIndex.run(BuildKmerIndex.kt:95)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:37)

zrm22 · 2025-01-10T14:08:34Z

@ClayBirkett It looks like you are now filling up the kmer -> HapIdSet map now. You could set --max-sample-proportion to something lower than the default of .5 but that may heavily change your mapping rates. If you lower this parameter more Kmers will be put in the discard set.

The other option is to adjust the kmer mask so it only looks at fewer possibilities of Kmers. By default this only retains min Kmers that end in C but we can adjust it to be ending in CC which would limit things by about 4x which may help with this.

To do this set --hashMask to 15 and --hashFilter to 5. These values should work correctly and may relieve a lot of the issues you and @jesse-hill are seeing.

We are also close to opening up an alternative alignment approach using the ropebwt3 tool. This should work with any sized of genome(they have indexed terabases) fairly efficiently. We have things initially coded up and in a Pull Request which I will be merging in shortly. I will also need to write some basic documentation to show you how to run it. We have not done full tests with this and it is in a 'beta' stage but it might be worth trying out to see if it at least fixes the issues you are seeing.

ClayBirkett · 2025-01-10T15:41:11Z

What is the significance of a kmer ending in C or CC?

zrm22 · 2025-01-10T16:04:54Z

It is mainly to reduce the total number of Kmers. With a 32-mer we tend to see too many Kmers to fit in standard primitive data structures. By only using ones that end in C(effectively making it a 31-mer with a C added on) it reduces it by 4x which helps with RAM without purging non informative Kmers.

ClayBirkett added the bug Something isn't working label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: phg build-kmer-index #266

[BUG]: phg build-kmer-index #266

ClayBirkett commented Dec 31, 2024

zrm22 commented Dec 31, 2024

ClayBirkett commented Jan 2, 2025

ClayBirkett commented Jan 4, 2025

jesse-hill commented Jan 8, 2025

zrm22 commented Jan 10, 2025

ClayBirkett commented Jan 10, 2025

zrm22 commented Jan 10, 2025

[BUG]: phg build-kmer-index #266

[BUG]: phg build-kmer-index #266

Comments

ClayBirkett commented Dec 31, 2024

Description

Expected behavior

PHG version

zrm22 commented Dec 31, 2024

ClayBirkett commented Jan 2, 2025

ClayBirkett commented Jan 4, 2025

jesse-hill commented Jan 8, 2025

zrm22 commented Jan 10, 2025

ClayBirkett commented Jan 10, 2025

zrm22 commented Jan 10, 2025