-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Callvar throws an error when run on broadPeak file #643
Comments
So i fixed this by removing the peak names. However this leads to another issue: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
@LinearParadox It seems that MACS3 can't read the broadPeak file correctly. Could you share us the top few lines of your broadpeak file? |
My current file is: chr1 903718 907055 + 3338 I realize for the original error I accidentally put peak names as the first column. For context, I called a set of differential peaks in R. I'm interested in seeing if there are any variants within these peaks. So I filtered my bam file to include these peaks only, and am running it on the filtered bam and this bed file. I tested running this on the original peak calls and still get the same error. The original peak calls: chr1 17311 17589 Met.mRp.clN_peak_1 388 . 3.13389 40.8058 38.8152 |
Your original peak file looks fine to me. It's in the appropriate BED+ format. Only the first six columns are essential and they should be in order: 1. Chromosome; 2. Start; 3. End; 4. Name; 5. Score; 6. Strand. Perhaps you can try to cut the first six columns? Also, please make sure the delimiter in the file is "tab" (or "\t"). Then try again? |
Hmm... I am wrong with my previous comments. The error was found in: MACS/MACS3/Signal/ReadAlignment.pyx Line 307 in 1d6c38e
Here MACS tries to use the MD tag and CIGAR in BAM format to recover the reference sequence. MACS are supposed to see an integer but it turns out to be an empty string. I saw you mentioned that you made a BAM file for only the reads in the peaks. In fact, you don't have to do so. Please check the document here: https://macs3-project.github.io/MACS/docs/callvar.html You can use the unfiltered BAM file to call variants given the BAM file has been sorted and indexed (with a .bai file). |
SO I tried it with this, and I'm getting a different error. It's not fatal, but I'm not sure if it's impacting the results:
Edit: This also produces an empty variants file, which is somehwat unexpected. |
This error is caused due to the fact that, the size of a chunk of alignment result in BAM is decoded as larger than 2^32. It shouldn't happen normally since we only tried to decode 4bytes data into an unsigned int in the line of code pointed out by the error message: Line 574 in 1d6c38e
There must be something I didn't notice before. Could we check if this time, the BAM file you used is legit? How did you generate it? What's the output from |
Also, thanks for pointing out the error related to: MACS/MACS3/Commands/callvar_cmd.py Line 197 in 1d6c38e
Ideally, if there is any error during retrieving reads from a given peak region, MACS3 should skip the peak. And it seems this line make the whole program quit. A possible solution is to remove the peak from your peak file and try again. And I will fix it in the next release. -- fixed in 759e100 |
Let me double check if it's updated. Bam was generated through the nfcore atac seq pipeline. Weird thing is I remember running these samples through the same pipeline a while ago, and callvar running fine then. I reran them again, only difference being I used chromap as an aligner instead of bwa (default). samtools flagstat gives the following:
|
I see. This is a very important information. I just tested a bam file from |
OK. Now I got it. Here is an alignment in SAM format from
And this is the alignment from
The difference is on the 'MD' tag, with which The solution: if you have the original fasta file of the genome, you can fix these 'MD' tags by using samtools calmd -b chromap.bam genome.fa > md_fixed.bam
samtools sort -o md_fixed.sorted.bam md_fixed.bam
samtools index md_fixed.bam So, @LinearParadox, do you mind submitting a ticket to |
Hi @LinearParadox, did the |
Seems to be running fine now! Is there any way to speed it up by chance? It's been a day and I'm still waiting on it to finish. |
There are a couple of things to tweak to make the process faster. 1. decrease the number of peaks you plan to call variants; 2. use multiple CPUs with option |
Got it. The odd thing is when I pass the m option, my CPU usage never seems to go above 100% (when monitored with top). There seems to be no difference vs if I just didn't pass the option. |
When running on a broadpeak file, callvar does the following:
File "MACS3/Signal/RACollection.pyx", line 358, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos
File "MACS3/Signal/RACollection.pyx", line 382, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 376, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 408, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 307, in MACS3.Signal.ReadAlignment.ReadAlignment.get_REFSEQ
ValueError: invalid literal for int() with base 10: b''
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/miniconda3/envs/MACS3/bin/macs3", line 1028, in
main()
File "/envs/MACS3/bin/macs3", line 105, in main
run( args )
File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 235, in run
results = mapresults.get(timeout=window_size*300)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 774, in get
raise self._value
ValueError: invalid literal for int() with base 10: b''
The text was updated successfully, but these errors were encountered: