Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate reads in BAM #30

Open
JenniferShelton opened this issue Sep 2, 2020 · 2 comments
Open

Duplicate reads in BAM #30

JenniferShelton opened this issue Sep 2, 2020 · 2 comments

Comments

@JenniferShelton
Copy link

Hi,

This tool is very useful in my workflow but I have a question about the preprocessing step.

The chr6 coordinates overlap causing the final alignment to have duplicate primary alignments for the same read. The number of primary alignments matches the number of times the filter step used "samtools view chr6:START1-END1...chr6:STARTn-ENDn" .

For example, if I use this process to filter reads it duplicats reads in the first samtools view and then duplicats them a second time for the second samtools extraction. These four reads then have primary alignments to different contigs in the panel (each of which has a flag of 83 or 163)
https://github.com/Kingsford-Group/kourami/blob/master/scripts/alignAndExtract_hs38DH.sh

A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*266:01	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*266:01	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*379:01	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*379:01	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*351:01	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*351:01	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*20:01:02	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*20:01:02	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76

The number of alignments is arbitrarily large depending on how many times the read was included in the FASTQ. They can also align multiple times to the same contig. Should I actually be creating a BAM with all possible alignments or with one primary alignment only?

@heewookl
Copy link
Contributor

BAM with one primary alignment should work fine as one primary alignment will be projected on to our graph representation of alleles. My guess is that the original bam file that you have duplicated entries for this read.

@JenniferShelton
Copy link
Author

JenniferShelton commented Mar 29, 2023

Thanks @heewookl, the overlapping regions will pull the same primary read once for each of the regions in the overlap (leading to multiple primary alignments in the output). I have a workflow that corrects this at https://bitbucket.nygenome.org/projects/COMPBIO/repos/hla_prep/browse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants