group method: not all UMI-containing reads assigned UG tag #635

epiliper · 2024-03-20T23:37:39Z

Hi! After running group on a sorted bamfile of roughly 3.4 million reads, I noticed that only half (~1.6 million reads) were actually given UG tags.

I checked this by running samtools view -d UG on the grouped file and counting the number of reads meeting these filter conditions.

All reads in the bamfile I used should contain UMIs. Is this expected behavior, and if so, what might cause some reads to not get assigned to a read group?

Thanks in advance for your patience; I'd really appreciate any help/explanation.

I'm a new user so apologies if I'm missing the obvious.

The text was updated successfully, but these errors were encountered:

TomSmithCGAT · 2024-03-21T10:49:29Z

Hi @epiliper - Could you please post the umi_tools group command used. Cheers

epiliper · 2024-03-21T11:07:18Z

Command used:

umi_tools group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=":" --paired -S UMI3-RSV90-10_S2_umi.bam

Here is the QNAME of a typical read in my input bamfile:

M04202:286:000000000-L5TBR:1:1102:16626:7061:TGTCAGGCTAT

Where TGTCAGGCTAT is the UMI.

When I run the above command, it runs with these options:

UMI-tools version: 1.1.4
 output generated by group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=: --paired -S UMI3-RSV90-10_S2_umi.bam
job started at Thu Mar 21 04:04:31 2024 on MacBeth.users.gowaveg.com -- fb9c6e30-e5bd-4b2f-9b5f-62ccf0d1abd9
pid: 40348, system: Darwin 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:43 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6000 arm64
assigned_tag                            : None
cell_tag                                : None
cell_tag_delim                          : None
cell_tag_split                          : -
chimeric_pairs                          : use
chrom                                   : None
compresslevel                           : 6
detection_method                        : None
filter_umi                              : None
gene_tag                                : None
 gene_transcript_map                     : None
get_umi_method                          : read_id
ignore_tlen                             : False
ignore_umi                              : False
in_sam                                  : False
 log2stderr                              : False
loglevel                                : 1
 mapping_quality                         : 0
method                                  : directional
no_sort_output                          : False
out_sam                                 : False
output_bam                              : True
output_unmapped                         : False
paired                                  : True
per_cell                                : False
per_contig                              : False
per_gene                                : False
random_seed                             : None
read_length                             : False
short_help                              : None
skip_regex                              : ^(__|Unassigned)
soft_clip_threshold                     : 4
spliced                                 : False
stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdin                                   : <_io.TextIOWrapper name='UMI3-RSV90-10_S2.aligned.sorted.bam' mode='r' encoding='UTF-8'>
stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
stdout                                  : <_io.TextIOWrapper name='UMI3-RSV90-10_S2_umi.bam' mode='w' encoding='UTF-8'>
subset                                  : None
threshold                               : 1
timeit_file                             : None
timeit_header                           : None
timeit_name                             : all
tmpdir                                  : None
tsv                                     : None
umi_group_tag                           : BX
umi_sep                                 : :
umi_tag                                 : RX
umi_tag_delim                           : None
umi_tag_split                           : None
umi_whitelist                           : None
umi_whitelist_paired                    : None
unmapped_reads                          : discard
unpaired_reads                          : use
whole_contig                            : False

Thanks again.

epiliper · 2024-03-21T21:55:55Z

Update: I think I know what might be happening...

Looking at part of the definition of get_bundles() in lines 320-327 of sam-methods:

if read.is_read2:
                if self.return_read2:
                    if not read.is_unmapped or (
                            read.is_unmapped and self.return_unmapped):
                        yield read, None, "single_read"
                continue
            else:
                self.read_events['Input Reads'] += 1

If I'm understanding this right, umi-tools doesn't consider read2 in grouping and downstream analysis. That would explain why roughly half of my reads are getting tagged.

Does this sound like a reasonable cause?

I apologize if I missed this in the docs; I guess checking the percentage of reads tagged with "UG" after running group isn't commonly done.

TomSmithCGAT · 2024-03-26T12:51:54Z

Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉

Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options

UMI-tools/umi_tools/group.py

Lines 242 to 253 in 7e799bc

    
           for bundle, key, status in bundle_iterator(inreads): 
        
               # write out read2s and unmapped/chimeric (if these options are set) 
        
               if status == 'single_read': 
        
                   # bundle is just a single read here 
        
                   nInput += 1 
        
                   if outfile: 
        
                       outfile.write(bundle) 
        
                   nOutput += 1 
        
                   continue

Looking over the documentation, I don't think this is stated anywhere. @IanSudbery, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify.

IanSudbery · 2024-03-26T19:01:53Z

I don't see it in the documentation either. I think it should definitely be documented. I also wonder if a tool similar to prepare-for-rsem could add grouping info to the read2s. Ian Sudbery (He/Him) Senior Lecturer in Bioinformatics, Sheffield Institute for Nucleic Acids, School of Biosciences, The University of Sheffield. web: www.sudlab.co.uk Tel: 0114 222 2738 Twitter: IanSudbery Show Calendar Availability <https://calendar.google.com/calendar/u/0?cid=aS5zdWRiZXJ5QHNoZWZmaWVsZC5hYy51aw>

…

On Tue, 26 Mar 2024 at 12:52, Tom Smith ***@***.***> wrote: Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉 Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options https://github.com/CGATOxford/UMI-tools/blob/7e799bc120f185128e3983cbc328e180a0b6b263/umi_tools/group.py#L242-L253 Looking over the documentation, I don't think this is stated anywhere. @IanSudbery <https://github.com/IanSudbery>, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify. — Reply to this email directly, view it on GitHub <#635 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJELDXZMV55M36RQUT3GNDY2FOPDAVCNFSM6AAAAABFANQ64KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGM2DMNZYGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

TomSmithCGAT · 2024-03-26T21:21:07Z

OK, I'll put updating the docs regarding this on the to-do.

If we do want to include read tags on read2, I'm torn between allowing this in group and a separate tool to add these afterwards. The former seem cleaner and shouldn't be too hard, but the later will probably be quicker to actually get done! Thoughts?..

epiliper · 2024-03-26T23:56:22Z

Thank you so much for confirming the cause of half-tagged reads! Peace of mind has been restored.

If you're asking me for my thoughts:

I asked some labmates also involved in UMI NGS, and we think a solution built within UMI-tools group for R2 processing would be great.
A coworker proposed separating, grouping and deduplicating R1 and R2 in parallel, then aligning them both to the reference in a final bam file.

All of this is of course just proposals; Either way, we'd be interested to hear more about this considering it would have a massive impact on how we set up our NGS runs.

Thanks again!

TomSmithCGAT self-assigned this Mar 26, 2024

TomSmithCGAT added the todo label Mar 26, 2024

TomSmithCGAT added Documentation and removed todo labels Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group method: not all UMI-containing reads assigned UG tag #635

group method: not all UMI-containing reads assigned UG tag #635

epiliper commented Mar 20, 2024 •

edited

Loading

TomSmithCGAT commented Mar 21, 2024

epiliper commented Mar 21, 2024

epiliper commented Mar 21, 2024 •

edited

Loading

TomSmithCGAT commented Mar 26, 2024

IanSudbery commented Mar 26, 2024 via email

TomSmithCGAT commented Mar 26, 2024

epiliper commented Mar 26, 2024

group method: not all UMI-containing reads assigned UG tag #635

group method: not all UMI-containing reads assigned UG tag #635

Comments

epiliper commented Mar 20, 2024 • edited Loading

TomSmithCGAT commented Mar 21, 2024

epiliper commented Mar 21, 2024

epiliper commented Mar 21, 2024 • edited Loading

TomSmithCGAT commented Mar 26, 2024

IanSudbery commented Mar 26, 2024 via email

TomSmithCGAT commented Mar 26, 2024

epiliper commented Mar 26, 2024

epiliper commented Mar 20, 2024 •

edited

Loading

epiliper commented Mar 21, 2024 •

edited

Loading