Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.1.5 kicking out one of pair of discordant reads ; v1.1.1 not showing same behaviour #664

Open
alexander-e-f-smith opened this issue Oct 9, 2024 · 12 comments

Comments

@alexander-e-f-smith
Copy link

Describe the bug
Following upgrade of UMI-tools to v1.1.5 (still testing under commit bcce5e6 for python 1.12 compatibility) in our fusion calling pipeline, I have seen specific reads being kicked out of the dedup process where they weren't under version 1.1.1. These reads are discordant /chimeric type reads where each of a pair is aligned to a different chromosome. As this is a fusion calling pipeline, such reads are important. Note also, I remove any unaligned/improper pairs prior to the dedup process, as I know umi-tools doesn't handle these well/in way I want it to.

To Reproduce
Exampled below are a pair of reads in a cleaned bam file going into dedup process and grep of same read ID following dedup process (dedup.bam)

for v1.1.1
Input into umi-tools:
samtools view sample1.cleaned.bam | grep ":3:23612:8962:11359_GCAACATGTG"
read:3:23612:8962:11359_GCAACATGTG    145     chr9    130854936       255     126M    chr22   23289524        0       GGCCTGTGTCCCGCAATGCCGCTGAGTATCTGCTGAGCAGCGGGATCAATGGCAGCTTCTTGGTGCGTGAGAGTGAGAGCAGTCCTGGCCAGAGGTCCATCTCGCTGAGATACGAAGGGAGGGTGT   E<A/EAEAA</A6AAAA<<6/AA<<AEEEEAAEEAAA<<A<<<AE<A/EEEE<EEEEEEEEEEEE<AE<EEEEEEEEEEEEEEEEEEEAEEEEEEEAEAEEEAAEEEEEEEEEEEEEAEEA/EEEE  RG:Z:sample1   NH:i:1  HI:i:1   NM:i:0  nM:i:0  AS:i:124
read:3:23612:8962:11359_GCAACATGTG    97      chr22   23289524        255     98M717N50M      chr9    130854936       0       AAGCTTCTCCCTGACATCCGTGGAGCTGCAGATGCTGACCAACTCGTGTGTGAAACTCCAGACTGTCCACAGCATTCCGCTGACCATCAATAAGGAAGATGATGAGTCTCCGGGGCTCTATGGGTTTCTGAATGTCATCGTCCACTCA     AAAAAEEEEEAEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAAEEEAE<EEEEEEEEEEEEEEEEEEEEEAAEA<<AEEEAEEEEEEAEAAEEEEAEEEEEEA<EAAEAEAE     RG:Z:sample1   NH:i:1  HI:i:1  NM:i:0  nM:i:0  AS:i:148

Umi-tools dedup output:
samtools view sample1.dedupUT.bam | grep ":3:23612:8962:11359_GCAACATGTG"
read:3:23612:8962:11359_GCAACATGTG    97      chr22   23289524        255     98M717N50M      chr9    130854936       0       AAGCTTCTCCCTGACATCCGTGGAGCTGCAGATGCTGACCAACTCGTGTGTGAAACTCCAGACTGTCCACAGCATTCCGCTGACCATCAATAAGGAAGATGATGAGTCTCCGGGGCTCTATGGGTTTCTGAATGTCATCGTCCACTCA     AAAAAEEEEEAEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAAEEEAE<EEEEEEEEEEEEEEEEEEEEEAAEA<<AEEEAEEEEEEAEAAEEEEAEEEEEEA<EAAEAEAE     RG:Z:sample1   NH:i:1  HI:i:1  NM:i:0  nM:i:0  AS:i:148
read:3:23612:8962:11359_GCAACATGTG    145     chr9    130854936       255     126M    chr22   23289524        0       GGCCTGTGTCCCGCAATGCCGCTGAGTATCTGCTGAGCAGCGGGATCAATGGCAGCTTCTTGGTGCGTGAGAGTGAGAGCAGTCCTGGCCAGAGGTCCATCTCGCTGAGATACGAAGGGAGGGTGT   E<A/EAEAA</A6AAAA<<6/AA<<AEEEEAAEEAAA<<A<<<AE<A/EEEE<EEEEEEEEEEEE<AE<EEEEEEEEEEEEEEEEEEEAEEEEEEEAEAEEEAAEEEEEEEEEEEEEAEEA/EEEE  RG:Z:sample1   NH:i:1  HI:i:1   NM:i:0  nM:i:0  AS:i:124
for v1.1.5
same input as above

Umitools output (only one of pair output):
samtools view sample1rpt.dedupUT.bam | grep ":3:23612:8962:11359_GCAACATGTG"
read:3:23612:8962:11359_GCAACATGTG    97      chr22   23289524        255     98M717N50M      chr9    130854936       0       AAGCTTCTCCCTGACATCCGTGGAGCTGCAGATGCTGACCAACTCGTGTGTGAAACTCCAGACTGTCCACAGCATTCCGCTGACCATCAATAAGGAAGATGATGAGTCTCCGGGGCTCTATGGGTTTCTGAATGTCATCGTCCACTCA     AAAAAEEEEEAEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAAEEEAE<EEEEEEEEEEEEEEEEEEEEEAAEA<<AEEEAEEEEEEAEAAEEEEAEEEEEEA<EAAEAEAE     RG:Z:sample1rpt   NH:i:1  HI:i:1  NM:i:0  nM:i:0  AS:i:148

Umi-tools options run here are the same for both version/examples seen above including:

--method=directional --unmapped-reads=use --chimeric-pairs=use --unpaired-reads=use --no-sort --spliced-is-unique -- 
 multimapping-detection-method=NH --paired --soft-clip-threshold=10 --mapping-quality=0 --log2stderr --random-seed=100

Expected behaviour
Output both reads of the pair. I can see from downstream pipeline stage of converting bam to fastq, there are now significant numbers of single reads (as per samtools fastq -s option) following v1.1.5 dedup - not see before under v1.1.1. These will include such singletons as seen above

Environment

  • UMI-tools version 1.1.1(tag): linux buster, py3.7 in docker
  • UMI-tools version1.1.5 commit bcce5e6: bullseye py3.12 in docker
  • How was UMI-tools installed: <
    download/clone then python3 -m pip install -r requirements.txt && python3 setup.py install

Additional context
Add any other context about the problem here. Was UMI-tools being run within another pipeline, etc.

@alexander-e-f-smith
Copy link
Author

I can see you have now have 1.1.6 tagged release, which I'll now use/test to avoid any additional variables - can see some documentation clarification regarding unpaired/unaligned reads .. no obvious code/functionality change relating to this issue however

@IanSudbery
Copy link
Member

The only conceivable relevant change I can see between 1.1.1 and 1.1.6 is that we changed the pairwriter to hold two separate file handles to the BAM file open, rather than multiple iterators referencing the same BAM file. See PR #543. I really don't think this should be an issue though. Can you check the log file for two things:

  1. The the process did indeed finish - that there is a # processed finished in XXX seconds at the end of the log.
  2. If the number of mates not found has changed. This should be at the end of the log, and will tell us if UMI-tools is looking for the mate, but not finding it, or if it isn't even looking.

The other possibility, I guess, is that pysam has changed something in the way it handles BAM files. Can you report the pysam versions in each case?

@alexander-e-f-smith
Copy link
Author

alexander-e-f-smith commented Oct 11, 2024

Hiya. Just ran comparison on much smaller dataset(is what I have at hand locally) and can see v1.1,6 is 'not finding' some mates where as v1.1.1 is not having same issue:

UMI-tools version: 1.1.6:

0-11 16:04:38,612 INFO Searching for mates for 103 unmatched alignments
2024-10-11 16:04:38,613 INFO 103 mates never found
2024-10-11 16:05:04,919 INFO Reads: Input Reads: 923723, Read pairs: 923723, Chimeric read pair: 379
2024-10-11 16:05:04,920 INFO Number of reads out: 848881
2024-10-11 16:05:04,920 INFO Total number of positions deduplicated: 353342
2024-10-11 16:05:04,920 INFO Mean number of unique UMIs per position: 2.43
2024-10-11 16:05:04,920 INFO Max. number of unique UMIs per position: 1286
# job finished in 106 seconds at Fri Oct 11 16:05:04 2024 -- 110.84  0.70  0.00  0.00 -- 9d733e96-b6a7-4567-b914-2c9b7fccd4eb

UMI-tools version: 1.1.1

2024-10-11 16:00:06,831 INFO Searching for mates for 103 unmatched alignments
2024-10-11 16:00:11,294 INFO 0 mates never found
2024-10-11 16:00:20,442 INFO Reads: Input Reads: 923723, Read pairs: 923723, Chimeric read pair: 379
2024-10-11 16:00:20,442 INFO Number of reads out: 848881
2024-10-11 16:00:20,442 INFO Total number of positions deduplicated: 353342
2024-10-11 16:00:20,442 INFO Mean number of unique UMIs per position: 2.43
2024-10-11 16:00:20,442 INFO Max. number of unique UMIs per position: 1286
# job finished in 85 seconds at Fri Oct 11 16:00:20 2024 -- 87.49  3.06  0.01  0.03 -- 9e15d38e-43b7-42cc-832f-a0a344e08c65

Note, I have run your updated code for pytest on the 1.1.6 version installation above which completes with no errors!
Pysam for 1.1.6 : pysam==0.22.1 under python 3.12
Pysam for 1.1.1 : pysam==0.16.0.1 under python 3.7

@IanSudbery
Copy link
Member

I don't suppose there is any way you could either:

  1. Run umi_tools 1.1.6 with pysam 0.16.01 or umi_tools 1.1.1 with pysam 0.22.1

OR

  1. Let me have a copy of a file that shows this problem so I can try to debug here?

@IanSudbery
Copy link
Member

However, the fact that the final search for mates in 1.1.6 takes less than a millisecond doesn't fill me with hope.

@alexander-e-f-smith
Copy link
Author

alexander-e-f-smith commented Oct 11, 2024

Yeh managed to install 1.1.1 with pysam 0.22.1 (other way around not quickly working for me...compatibility errors...):
Shows no issue, so seemingly not pysam alone:

2024-10-11 16:59:34,542 INFO 0 mates never found
2024-10-11 16:59:43,627 INFO Reads: Input Reads: 923723, Read pairs: 923723, Chimeric read pair: 379
2024-10-11 16:59:43,627 INFO Number of reads out: 848881
2024-10-11 16:59:43,627 INFO Total number of positions deduplicated: 353342
2024-10-11 16:59:43,628 INFO Mean number of unique UMIs per position: 2.43
2024-10-11 16:59:43,628 INFO Max. number of unique UMIs per position: 1286
# job finished in 84 seconds at Fri Oct 11 16:59:43 2024 -- 86.05  2.94  0.01  0.02 -- 159e8154-1cb9-49e1-ada9-93df25d122e8
real 87.33

May be able to sent some control data . May need to wait until next week now.

@IanSudbery
Copy link
Member

Are you comfortable installing a custom version?

If so could you pull the branch at:
https://github.com/CGATOxford/UMI-tools/tree/%7BIS%7D_chimeric_debugging

install into the relevant environment (python setup.py install) and try that?

I've reverted the only line that I think could possibly be connected. Its not a long term fix, as this will slow umi_tools down several hundred fold in some situations, but at least we might be able to pin point the problem.

@alexander-e-f-smith
Copy link
Author

Hi Ian, sorry for late response. Yes I can have a look at as soon as I can.

@alexander-e-f-smith
Copy link
Author

Bssed on UMI-tools out logs only (i've not had time to look at the actual reads in question), this seems to have fixed the issue ie under the branch above (commit 8eb8e93d7d981a121d857ad411f16cf8ef7ce0c1) we see:

2024-10-15 14:25:30,024 INFO Searching for mates for 103 unmatched alignments
2024-10-15 14:25:36,808 INFO 0 mates never found

@IanSudbery
Copy link
Member

Okay, how about the lastest commit on that branch?

@alexander-e-f-smith
Copy link
Author

Hi Ian. Apologoes for the pause in following this up. I'm back working on the pipeline which relates to this issue now. I have installed the latest commit on the bug fix brach in question (commit 0a7c536) and can see from my data the issue seems to be fixed here as well :

2024-11-14 10:39:10,494 INFO Searching for mates for 103 unmatched alignments
2024-11-14 10:39:16,098 INFO 0 mates never found
2024-11-14 10:39:41,912 INFO Reads: Input Reads: 923723, Read pairs: 923723, Chimeric read pair: 379
2024-11-14 10:39:41,912 INFO Number of reads out: 848881
2024-11-14 10:39:41,912 INFO Total number of positions deduplicated: 353342
2024-11-14 10:39:41,912 INFO Mean number of unique UMIs per position: 2.43
2024-11-14 10:39:41,912 INFO Max. number of unique UMIs per position: 1286

As such do you think this will be merged into main/released?

@IanSudbery
Copy link
Member

I need to do some benchmarking first, but I'm cautiously optimistic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants