bam_stats fails to give informative error on BAMs with no RG lines/tags #66

leovam · 2021-07-22T18:41:30Z

Hi,
I have been trying to use cgpwgs to call the variants on my own BAM files. I saw it said the BAM files are expected to have been mapped using cgpmap. We did not use the cgpmap to get the BAM files but got them with the common routine: 1) align the pair-end fastq files to hg38 using BWA MEM to get the SAM files, 2) convert SAM to BAM and 3) sort and index the BAM files with SAMTOOLS.

then we used bam_stats in the dockstore-cgpwgs to get the bas file for the sorted BAM files. the script was like:

docker run -ti -rm
...some setting \
quay.io/wtsicgp/dockstore-cgpwgs:2.1.1 \
bam_stats \
-i /var/spool/data/tumorB.bam \
-o /var/spool/data/tumorB.bas \
-@ 4

however, the program finsied immediately after hittign "enter" and no bas files produced at all and there is no error message so we did not know what went wrong. We also test the script on the example BAM files and it will give us the corresponding bas files in about 1 minute.

We wonder:

what would be wrong in the case aforementioned and how to fix it
Do we have to use the cgpmap to do the alignment? What would be different if we just align the reads by following the common routine aforementioned or by using other aligners such as BOWTIE2? We asked this because we have long read fastq files that need other tools for accurate alignment (e.g. minimap2)
if we have to use other aligners to get the BAM files, is it not recommended to use cgpwgs? if not, do you have any pipeline for recommendations that would give the user more freedom?

Thanks!

The text was updated successfully, but these errors were encountered:

keiranmraine · 2021-07-23T08:07:11Z

Hi,

bam_stats should work on any well formed BAM file. It's possible it may fail if the reads have RG tags that aren't represented in the header but I would expect to see an error. Are you able to provide the header and first couple of records of your BAM file? This command will create a new BAM with just 10 reads while including the original header:

(samtools view -H /var/spool/data/tumorB.bam ; samtools view /var/spool/data/tumorB.bam | head -n 10) | samtools view -b - > mini.bam

cgpmap mapping is not required to run cgpwgs, however that is what it is intended for, so you would need to evaluate the results are acceptable.
cgpwgs is intended to be a replication of our internal methodology, hence the indication that cgpmap is the preferred input. We haven't performed extensive testing on other mapping pipelines.

keiranmraine · 2021-07-23T20:25:14Z

Please run the command I provided to generate the file and attach it to the issue (drag'n drop), the long comment has been deleted.

leovam · 2021-07-26T16:18:51Z

Github cannot support the BAM file as an attachment, is there another way I can share the file?

keiranmraine · 2021-08-03T10:27:14Z

I have reviewed your file (via email). Your BAM file is invalid using PHREAD+64 encoding for the quality scores. Note all your reads have a run of lllll as the quality score. This is only possible if phread+64 has been used. Many tools do not validate this incorrect use of the format.

QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A
base quality is the phred-scaled base error probability which equals −10 log10 Pr{base is wrong}. This
field can be a ‘’ when quality is not stored. If not a ‘’, SEQ must not be a ‘*’ and the length of the
quality string ought to equal the length of SEQ.

You can find more information on phread encoding here:

https://en.wikipedia.org/wiki/FASTQ_format#Encoding

keiranmraine · 2021-08-04T08:14:13Z

Correspondence with new BAM

Thanks for your response to the issue on Github. I have some follow-up questions but that thread was closed (reopened).

The previous data was generated by simulation and we accidentally make the quality score using PHRED+64 encode. While we are now trying to set up the quality score using PHRED+33, we did some other tests on the data from a commercial sequencing company. I believe the quality score in the real data is based on PHRED+33 but still got the same issue: the program finished immediately after hitting "enter" and no .bas files produced at all and there is no error message.

I am attaching the new mini.bam as following, which was generated using the code you provided in the original post. Please note, the real data is targeted sequencing on a list of specific genes rather than WGS, but I believe bam_stat just generates some statistics file from the BAM file, no matter what kind of sequencing data they are. Please correct me if I am wrong and provide some hints about what might be wrong. Thanks!

Best

keiranmraine · 2021-08-04T08:49:33Z

This appears to be related to a lack of readgroup header (and aux tag). The code is failing when attempting to handle the absence of the sample information.

$ valgrind --track-origins=yes bam_stats -i ~/mini.bam -o mini.bam
==3134060== Memcheck, a memory error detector
==3134060== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3134060== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==3134060== Command: bam_stats -i /home/kr2/mini.bam -o mini.bam
==3134060== 
==3134060== Use of uninitialised value of size 8
==3134060==    at 0x111DB4: bam_access_parse_header (bam_access.c:110)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060==  Uninitialised value was created by a heap allocation
==3134060==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3134060==    by 0x111D91: bam_access_parse_header (bam_access.c:108)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060== 
==3134060== Invalid write of size 8
==3134060==    at 0x111DB4: bam_access_parse_header (bam_access.c:110)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==3134060== 
==3134060== 
==3134060== Process terminating with default action of signal 11 (SIGSEGV)
==3134060==  Access not within mapped region at address 0x0
==3134060==    at 0x111DB4: bam_access_parse_header (bam_access.c:110)
==3134060==    by 0x11150C: main (bam_stats.c:188)
==3134060==  If you believe this happened as a result of a stack
==3134060==  overflow in your program's main thread (unlikely but
==3134060==  possible), you can try to increase the size of the
==3134060==  main thread stack using the --main-stacksize= flag.
==3134060==  The main thread stack size used in this run was 8388608.
==3134060== 
==3134060== HEAP SUMMARY:
==3134060==     in use at exit: 142,854 bytes in 115 blocks
==3134060==   total heap usage: 119 allocs, 4 frees, 190,778 bytes allocated
==3134060== 
==3134060== LEAK SUMMARY:
==3134060==    definitely lost: 10 bytes in 2 blocks
==3134060==    indirectly lost: 0 bytes in 0 blocks
==3134060==      possibly lost: 0 bytes in 0 blocks
==3134060==    still reachable: 142,844 bytes in 113 blocks
==3134060==         suppressed: 0 bytes in 0 blocks
==3134060== Rerun with --leak-check=full to see details of leaked memory
==3134060== 
==3134060== For lists of detected and suppressed errors, rerun with: -s
==3134060== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

PCAP-core/c/bam_access.c

Lines 108 to 110 in 79f22b6

    
           check_mem(groups); 
        
           groups[0]->id = strdup("."); 
        
           groups[0]->sample = strdup(".");

@davidrajones can you take a look please? Bam attached with artificial .zip extension:

mini.bam.zip

@leovam you can work around this by defining a read group header and tagging the records with the ID. Please note that the cgpmap process is just a CWL wrapper around this repository (pcap-core). The methodology you use for mapping can be achieved using bwa_mem.pl which will both generate @RG headers and the bam_stats output automatically.

leovam · 2021-08-05T06:16:11Z

Thanks for the efforts to provide such information. I have a general question. Is @RG essential for those tools? If so, how? I also tried some GATK pipeline and it looks like @RG has to be there to pass some data check.

I use BWA-MEM directly to do the alignment instead of using bwa-mem.pl in the docker, why it could not generate @RG header? The reason we did not the cgpmap is that we have some long-read data where bwa-mem might not work well

"by defining a read group header and tagging the records with the ID" means that we can actually work with the current BAM file and add @RG to it? If so, do you have any recommended tool to do this? I apologize that I have so many questions that may be trivial since I am new to this field, and thanks for your help and time in advance.

keiranmraine · 2021-08-05T08:16:12Z

The @RG header can be important for duplicate removal tools. The main utility is to aid identification of the source of reads, such as which reads for a single merged sample file come from different sequencing runs (library, instrument, lane).

You can pass a RG entry into many mapping tools, but you can also apply it on a stream using samtools addreplacerg, see command line help but I'd expect something along the lines:

some_aligner ... | samtools addreplacerg ... -u- | samtools sort -O BAM ... - > sorted_RG.bam

Each lane of sequencing should be processed independently and then merged after the RG+sorting has been applied

P.S. I noticed in your example file that you have the core reference sequences and the targeted genes declared in the header. This is not something I would expect to see (certainly in short-read) and may result in your reads having very low mapping scores as they may be considered non-unique mappings.

cancerit deleted a comment from leovam Jul 23, 2021

keiranmraine closed this as completed Aug 3, 2021

keiranmraine reopened this Aug 4, 2021

keiranmraine transferred this issue from cancerit/dockstore-cgpwgs Aug 4, 2021

keiranmraine added the bug label Aug 4, 2021

keiranmraine assigned davidrajones Aug 4, 2021

keiranmraine changed the title ~~cannot run bam_stat on BAM files produced by BWA MEM~~ bam_stats fails to give informative error on BAMs with no RG lines/tags Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bam_stats fails to give informative error on BAMs with no RG lines/tags #66

bam_stats fails to give informative error on BAMs with no RG lines/tags #66

leovam commented Jul 22, 2021

keiranmraine commented Jul 23, 2021

keiranmraine commented Jul 23, 2021

leovam commented Jul 26, 2021

keiranmraine commented Aug 3, 2021

keiranmraine commented Aug 4, 2021

keiranmraine commented Aug 4, 2021

leovam commented Aug 5, 2021

keiranmraine commented Aug 5, 2021

bam_stats fails to give informative error on BAMs with no RG lines/tags #66

bam_stats fails to give informative error on BAMs with no RG lines/tags #66

Comments

leovam commented Jul 22, 2021

keiranmraine commented Jul 23, 2021

keiranmraine commented Jul 23, 2021

leovam commented Jul 26, 2021

keiranmraine commented Aug 3, 2021

keiranmraine commented Aug 4, 2021

keiranmraine commented Aug 4, 2021

leovam commented Aug 5, 2021

keiranmraine commented Aug 5, 2021