Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bam_stats fails to give informative error on BAMs with no RG lines/tags #66

Open
leovam opened this issue Jul 22, 2021 · 8 comments
Open
Assignees
Labels

Comments

@leovam
Copy link

leovam commented Jul 22, 2021

Hi,
I have been trying to use cgpwgs to call the variants on my own BAM files. I saw it said the BAM files are expected to have been mapped using cgpmap. We did not use the cgpmap to get the BAM files but got them with the common routine: 1) align the pair-end fastq files to hg38 using BWA MEM to get the SAM files, 2) convert SAM to BAM and 3) sort and index the BAM files with SAMTOOLS.

then we used bam_stats in the dockstore-cgpwgs to get the bas file for the sorted BAM files. the script was like:

docker run -ti -rm
...some setting \
quay.io/wtsicgp/dockstore-cgpwgs:2.1.1 \
bam_stats \
-i /var/spool/data/tumorB.bam \
-o /var/spool/data/tumorB.bas \
-@ 4

however, the program finsied immediately after hittign "enter" and no bas files produced at all and there is no error message so we did not know what went wrong. We also test the script on the example BAM files and it will give us the corresponding bas files in about 1 minute.

We wonder:

  1. what would be wrong in the case aforementioned and how to fix it
  2. Do we have to use the cgpmap to do the alignment? What would be different if we just align the reads by following the common routine aforementioned or by using other aligners such as BOWTIE2? We asked this because we have long read fastq files that need other tools for accurate alignment (e.g. minimap2)
  3. if we have to use other aligners to get the BAM files, is it not recommended to use cgpwgs? if not, do you have any pipeline for recommendations that would give the user more freedom?

Thanks!

@keiranmraine
Copy link
Contributor

Hi,

  1. bam_stats should work on any well formed BAM file. It's possible it may fail if the reads have RG tags that aren't represented in the header but I would expect to see an error. Are you able to provide the header and first couple of records of your BAM file? This command will create a new BAM with just 10 reads while including the original header:
(samtools view -H /var/spool/data/tumorB.bam ; samtools view /var/spool/data/tumorB.bam | head -n 10) | samtools view -b - > mini.bam
  1. cgpmap mapping is not required to run cgpwgs, however that is what it is intended for, so you would need to evaluate the results are acceptable.
  2. cgpwgs is intended to be a replication of our internal methodology, hence the indication that cgpmap is the preferred input. We haven't performed extensive testing on other mapping pipelines.

@keiranmraine
Copy link
Contributor

Please run the command I provided to generate the file and attach it to the issue (drag'n drop), the long comment has been deleted.

@cancerit cancerit deleted a comment from leovam Jul 23, 2021
@leovam
Copy link
Author

leovam commented Jul 26, 2021

Github cannot support the BAM file as an attachment, is there another way I can share the file?

@keiranmraine
Copy link
Contributor

I have reviewed your file (via email). Your BAM file is invalid using PHREAD+64 encoding for the quality scores. Note all your reads have a run of lllll as the quality score. This is only possible if phread+64 has been used. Many tools do not validate this incorrect use of the format.

  1. QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A
    base quality is the phred-scaled base error probability which equals −10 log10 Pr{base is wrong}. This
    field can be a ‘’ when quality is not stored. If not a ‘’, SEQ must not be a ‘*’ and the length of the
    quality string ought to equal the length of SEQ.

You can find more information on phread encoding here:

https://en.wikipedia.org/wiki/FASTQ_format#Encoding

@keiranmraine keiranmraine reopened this Aug 4, 2021
@keiranmraine keiranmraine transferred this issue from cancerit/dockstore-cgpwgs Aug 4, 2021
@keiranmraine
Copy link
Contributor

Correspondence with new BAM

Thanks for your response to the issue on Github. I have some follow-up questions but that thread was closed (reopened).

The previous data was generated by simulation and we accidentally make the quality score using PHRED+64 encode. While we are now trying to set up the quality score using PHRED+33, we did some other tests on the data from a commercial sequencing company. I believe the quality score in the real data is based on PHRED+33 but still got the same issue: the program finished immediately after hitting "enter" and no .bas files produced at all and there is no error message.

I am attaching the new mini.bam as following, which was generated using the code you provided in the original post. Please note, the real data is targeted sequencing on a list of specific genes rather than WGS, but I believe bam_stat just generates some statistics file from the BAM file, no matter what kind of sequencing data they are. Please correct me if I am wrong and provide some hints about what might be wrong. Thanks!

Best

@keiranmraine
Copy link
Contributor

This appears to be related to a lack of readgroup header (and aux tag). The code is failing when attempting to handle the absence of the sample information.

$ valgrind --track-origins=yes bam_stats -i ~/mini.bam -o mini.bam
==3134060== Memcheck, a memory error detector
==3134060== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3134060== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==3134060== Command: bam_stats -i /home/kr2/mini.bam -o mini.bam
==3134060== 
==3134060== Use of uninitialised value of size 8
==3134060==    at 0x111DB4: bam_access_parse_header (bam_access.c:110)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060==  Uninitialised value was created by a heap allocation
==3134060==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3134060==    by 0x111D91: bam_access_parse_header (bam_access.c:108)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060== 
==3134060== Invalid write of size 8
==3134060==    at 0x111DB4: bam_access_parse_header (bam_access.c:110)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==3134060== 
==3134060== 
==3134060== Process terminating with default action of signal 11 (SIGSEGV)
==3134060==  Access not within mapped region at address 0x0
==3134060==    at 0x111DB4: bam_access_parse_header (bam_access.c:110)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060==  If you believe this happened as a result of a stack
==3134060==  overflow in your program's main thread (unlikely but
==3134060==  possible), you can try to increase the size of the
==3134060==  main thread stack using the --main-stacksize= flag.
==3134060==  The main thread stack size used in this run was 8388608.
==3134060== 
==3134060== HEAP SUMMARY:
==3134060==     in use at exit: 142,854 bytes in 115 blocks
==3134060==   total heap usage: 119 allocs, 4 frees, 190,778 bytes allocated
==3134060== 
==3134060== LEAK SUMMARY:
==3134060==    definitely lost: 10 bytes in 2 blocks
==3134060==    indirectly lost: 0 bytes in 0 blocks
==3134060==      possibly lost: 0 bytes in 0 blocks
==3134060==    still reachable: 142,844 bytes in 113 blocks
==3134060==         suppressed: 0 bytes in 0 blocks
==3134060== Rerun with --leak-check=full to see details of leaked memory
==3134060== 
==3134060== For lists of detected and suppressed errors, rerun with: -s
==3134060== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

PCAP-core/c/bam_access.c

Lines 108 to 110 in 79f22b6

check_mem(groups);
groups[0]->id = strdup(".");
groups[0]->sample = strdup(".");

@davidrajones can you take a look please? Bam attached with artificial .zip extension:

mini.bam.zip

@leovam you can work around this by defining a read group header and tagging the records with the ID. Please note that the cgpmap process is just a CWL wrapper around this repository (pcap-core). The methodology you use for mapping can be achieved using bwa_mem.pl which will both generate @RG headers and the bam_stats output automatically.

@leovam
Copy link
Author

leovam commented Aug 5, 2021

Thanks for the efforts to provide such information. I have a general question. Is @RG essential for those tools? If so, how? I also tried some GATK pipeline and it looks like @RG has to be there to pass some data check.

I use BWA-MEM directly to do the alignment instead of using bwa-mem.pl in the docker, why it could not generate @RG header? The reason we did not the cgpmap is that we have some long-read data where bwa-mem might not work well

"by defining a read group header and tagging the records with the ID" means that we can actually work with the current BAM file and add @RG to it? If so, do you have any recommended tool to do this? I apologize that I have so many questions that may be trivial since I am new to this field, and thanks for your help and time in advance.

@keiranmraine
Copy link
Contributor

The @RG header can be important for duplicate removal tools. The main utility is to aid identification of the source of reads, such as which reads for a single merged sample file come from different sequencing runs (library, instrument, lane).

You can pass a RG entry into many mapping tools, but you can also apply it on a stream using samtools addreplacerg, see command line help but I'd expect something along the lines:

some_aligner ... | samtools addreplacerg ... -u- | samtools sort -O BAM ... - > sorted_RG.bam

Each lane of sequencing should be processed independently and then merged after the RG+sorting has been applied

P.S. I noticed in your example file that you have the core reference sequences and the targeted genes declared in the header. This is not something I would expect to see (certainly in short-read) and may result in your reads having very low mapping scores as they may be considered non-unique mappings.

@keiranmraine keiranmraine changed the title cannot run bam_stat on BAM files produced by BWA MEM bam_stats fails to give informative error on BAMs with no RG lines/tags Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants