Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion `num_minimizers <= static_cast<size_t>(INT_MAX)' failed #131

Open
minjeongjj opened this issue Jan 27, 2023 · 29 comments
Open

Assertion `num_minimizers <= static_cast<size_t>(INT_MAX)' failed #131

minjeongjj opened this issue Jan 27, 2023 · 29 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@minjeongjj
Copy link

Hello,

I want to run chromap using my genome file

But, coredumped went out

Here is the log file and command

Command
$chromap -i -r Combined_pseudohap.phased.filtered.0.arcs.fasta -o chromap.index -t 100 >chromap.index.log 2>chromap.index.log2

log file
Build index for the reference.
Kmer length: 17, window size: 7
Reference file: Combined_pseudohap.phased.filtered.0.arcs.fasta
Output file: chromap.index
Loaded all sequences successfully in 156.47s, number of sequences: 41577, number of bases: 19811410511.
Collecting minimizers.
Collected 4958576388 minimizers.
Sorting minimizers.
Sorted all minimizers.
chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast<size_t>(INT_MAX)' failed.

Are there any comments to figure out?

Best wishes,

@minjeongjj minjeongjj added the bug Something isn't working label Jan 27, 2023
@haowenz
Copy link
Owner

haowenz commented Jan 27, 2023

What is your reference genome? Why there are so many sequences and the total length is very long? It seems that the genome is too large for Chromap to handle.

@HMPNK
Copy link

HMPNK commented Aug 26, 2023

Same problem here with a 9Gb genome. What is the limit of Chromap? Could some parameters be changed to improve this?

PS: Had no problem with a 5GB genome before ...

@mourisl
Copy link
Collaborator

mourisl commented Aug 26, 2023

@HMPNK What is the longest chromosome of the 9GB genome?

@HMPNK
Copy link

HMPNK commented Aug 26, 2023

@mourisl
Total: 8834612447
Count: 2159
Average: 4091992.80
Median: 73976
N00: 123690798 1
N10: 78885008 10
N20: 58174874 23
N30: 48773516 40
N40: 37269988 61
N50: 29511668 87
N60: 23255327 122
N70: 17203678 165
N80: 11905554 227
N90: 6556291 320
N100: 4315 2159

@haowenz
Copy link
Owner

haowenz commented Aug 29, 2023

Did you get the same error message? Your genome is large and I guess it has more than 2^32-1 minimizers. If this is the case, it will require some code change to support very large genome.

@HMPNK
Copy link

HMPNK commented Aug 29, 2023

the error code was slightly different:

chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast<size_t>(0x7fffffff)' failed.

It collected 3.350.432.716 minimizers which is less than the maximum 2^32-1.

@haowenz
Copy link
Owner

haowenz commented Aug 30, 2023

I just checked. The max number of minimizers currently supported by Chromap is 2^31 - 1 instead of 2^32 - 1. So it would require some code change before Chromap can support large genomes like what you have.

@HMPNK
Copy link

HMPNK commented Aug 30, 2023

Could you provide that changes?
Are there other possibilities, like changing kmer_size and window

@HMPNK
Copy link

HMPNK commented Aug 30, 2023

I just did a test and using "-w13" worked. Increasing -w efficiently reduces number of minimizers. But I guess increasing "-w" will reduce sensitivity of mapping? What do you think?

@mourisl
Copy link
Collaborator

mourisl commented Aug 30, 2023

I just did a test and using "-w13" worked. Increasing -w efficiently reduces number of minimizers. But I guess increasing "-w" will reduce sensitivity of mapping? What do you think?

What is your read length? If your read is long, increase w probably won't affect the accuracy much.

@HMPNK
Copy link

HMPNK commented Aug 30, 2023

It is 2 times 150bp,

@mourisl
Copy link
Collaborator

mourisl commented Aug 30, 2023

I think 150bp should be fine to handle "-w 13". Since your genome is large, you can increase "-k" a little bit to ensure each minimizer is unique enough on the genome, maybe -k 23 -w 17. Then you will have 3 non-overlap windows to locate seeds. The default parameter was selected for 50bp scATAC-seq data.

@haowenz Is this reasonable?

@haowenz
Copy link
Owner

haowenz commented Aug 30, 2023

The fragment size can still be short though. Currently, increasing w is probably the only way to use large genome. It may affect sensitivity, but probably not much as you only increase it by 3 and the k-mer size doesn't change. For long term, we should support a larger number of minimizers.

@suragnair
Copy link

suragnair commented Oct 1, 2023

getting same issue for Axolotl genome which is even bigger around 27G. Do you think it will be possible to address this any time soon?

Any suggestions for -k and -w parameters? I have R1 50bp and R2 60bp bulk ATAC-seq. Setting -w 13 still fails.

@haowenz
Copy link
Owner

haowenz commented Oct 2, 2023

You may try keep k-mer length at 17 (-k 17) and increase window size to 13 (-w 13) and even larger to see if it works.

@suragnair
Copy link

suragnair commented Oct 2, 2023

-w 24 seems to be the smallest window size that works for this genome, and I'm getting an alignment rate of around 50% with that. That likely suggests that 24 is too large? Will probably need to benchmark against another aligner.

@haowenz
Copy link
Owner

haowenz commented Oct 3, 2023

-w 24 seems to be the smallest window size that works for this genome, and I'm getting an alignment rate of around 50% with that. That likely suggests that 24 is too large? Will probably need to benchmark against another aligner.

That's possible. Can you post more numbers here? It is also possible that the genome is repetitive and lots of multi-mappings are filtered out.

@haowenz haowenz added the enhancement New feature or request label Oct 3, 2023
@suragnair
Copy link

suragnair commented Oct 3, 2023

Number of reads: 1204478674.
Number of mapped reads: 824244368.
Number of uniquely mapped reads: 694616494.
Number of reads have multi-mappings: 129627874.
Number of candidates: 12259974313.
Number of mappings: 824244368.
Number of uni-mappings: 694616494.
Number of multi-mappings: 129627874.

Closer to 70% with multi-mappers. Uniquely mapped read-pairs (lines in the output file) is 288M, so closer to 48%.

I'll check bowtie2. The index is taking a long time to prepare.

@haowenz
Copy link
Owner

haowenz commented Oct 3, 2023

thanks for the numbers. You may try bowtie2. But it should be even slower to build an FM-index.

@Biscuite-wzy
Copy link

Same problem here with an about 9Gb genome, when I map Hi-C short reads, as follows. How can I solve it?

Build index for the reference.
Kmer length: 17, window size: 7
Reference file: ref.fasta
Output file: ref.index
Loaded all sequences successfully in 194.21s, number of sequences: 5791, number of bases: 8786216834.
Collecting minimizers.
Collected 2217946008 minimizers.
Sorting minimizers.
Sorted all minimizers.
chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast<size_t>(0x7fffffff)' failed

@mourisl
Copy link
Collaborator

mourisl commented Nov 13, 2023

@Biscuite-wzy You can increase k-mer length (-k) and window size (-w) values a bit to see whether it works. How long is the longest chromosome in your genome?

@Biscuite-wzy
Copy link

Biscuite-wzy commented Nov 13, 2023 via email

@lskfs
Copy link

lskfs commented Jan 23, 2024

is there any fix for this problem? I run into the same issue when using axolotl genome, and have to set window size with greater number (-w 31) to build genome index.

however, I suppose that the large window size would not be suitable for me as I have dataset from different species genome which is generated using the default parameters. so here just want to know if there is any update?

@haowenz
Copy link
Owner

haowenz commented Jan 31, 2024

Besides tuning the parameters, there is no easy fix on top of the current Chromap codebase to support a very huge genome. We plan to see if this is possible in the near future.

@afiyachida
Copy link

Hi, was this issue fixed in the recent version of Chromap (0.2.6) ? I have genomes of 18Gb and 26Gb.

Build index for the reference. Kmer length: 17, window size: 7 Collected 3812725164 minimizers. Sorted minimizers. chromap: src/index.cc:178: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertionnum_minimizers != 0 && num_minimizers <= 0x7fffffff' failed.
.command.sh: line 7: 239015 Aborted`

What would be the best -k and -w for genomes of this size?

Thank you.

@mourisl
Copy link
Collaborator

mourisl commented Nov 12, 2024

This has not been fixed. I think the best way to handle this is to use a standard value for k, but increase the value for -w.

@afiyachida
Copy link

This has been fixed. I think the best way to handle this is to use a standard value for k, but increase the value for -w.

Hi, thank you for the reply. I am currently using version 0.2.1. So I guess, updating the version would be help to resolve this. Can I use -w 20 for this size genome ?

@mourisl
Copy link
Collaborator

mourisl commented Nov 12, 2024

Sorry, I made a typo..it has NOT been fixed..

@afiyachida
Copy link

Sorry, I made a typo..it has NOT been fixed..

Oh! Then probably using 0.2.6 won't solve it. I will try to increase value for -w and check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants