Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative PR to add duplicate marking to GroupReadsByUmi #940

Merged
merged 24 commits into from
Oct 26, 2023

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Sep 28, 2023

@JoeVieira and @nh13 this is my attempt to provide a cleaned up version of #934. I figured this was easier than trying to shepherd the original PR through with lots of comments - though I greatly appreciate you making that PR Joe as it showed me how small the changes actually are and motivated me to do this.

Some of the changes stylistic / scala idiom stuff. Functional changes include:

  • Auto-defaulting of minMapQ and including of secondary/supplementary based on whether --mark-duplicates is provided
  • I chose not to expose the duplicate ScoringStrategy as I'm not sure I want to tie ourselves to that, and just picked one to use for now
  • It now also reset the duplicate flag to false on the chosen representative template's reads

Edited to add: Oh, and also changing the code downstream to avoid the consensus callers seeing secondary/supplementary reads should a BAM generated with --mark-duplicates get sent downstream.

@tfenne tfenne requested a review from nh13 September 28, 2023 23:18
@tfenne
Copy link
Member Author

tfenne commented Sep 28, 2023

Tests could still use some reformatting to follow convention of other code.

@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (bcb5577) 95.60% compared to head (61b46b1) 95.65%.
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #940      +/-   ##
==========================================
+ Coverage   95.60%   95.65%   +0.05%     
==========================================
  Files         126      126              
  Lines        7305     7460     +155     
  Branches      517      526       +9     
==========================================
+ Hits         6984     7136     +152     
- Misses        321      324       +3     
Flag Coverage Δ
unittests 95.65% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...fulcrumgenomics/umi/ConsensusCallingIterator.scala 100.00% <100.00%> (ø)
...cala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala 95.58% <100.00%> (+0.22%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I'm confused about, but looking really good!

|chromosomes and `--allow-inter-contig` has been set to false.
|read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,
|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false)
|if `--include-secondary` and\or `--include-supplementary` are set to false (default=false)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't grok 446-447 after multiple attempts at reading the usage (before looking at the code). Perhaps you don't mean "non-primary"? And if not, then I'm really confused!

|read has mapping quality < `min-map-q` (default=1), or c) the primary mappings for R1 and R2 are on different
|chromosomes and `--allow-inter-contig` has been set to false.
|read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,
|chromosomes and `--allow-inter-contig` has been set to false, or d.) all non-primary reads are filtered,

Comment on lines 493 to 494
@arg(flag='t', doc="The tag containing the raw UMI.") val rawTag: String = "RX",
@arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we finally update these to use

and ?


/** Sets the duplicate flags on all reads within all templates. */
private def setDuplicateFlags(group: Seq[Template]): Unit = {
val nonDuplicateTemplate = group.minBy { template =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want maxBy here? To my eye, htsjdk's duplicate score returns higher values as better:

https://github.com/samtools/htsjdk/blob/38dbd0e6b87a2bd5791bb81399f2b5ec1008773e/src/main/java/htsjdk/samtools/DuplicateScoringStrategy.java#L73-L116

For example, if the duplicate score is the sum of base qualities, then don't we want the template with the highest sum of base qualities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

}

group.foreach { template =>
val flag = !(template eq nonDuplicateTemplate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val flag = !(template eq nonDuplicateTemplate)
val flag = template ne nonDuplicateTemplate

private def setDuplicateFlags(group: Seq[Template]): Unit = {
val nonDuplicateTemplate = group.minBy { template =>
template.primaryReads.sumBy { r =>
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a compromise, consider setting the ScoringStrategy as a constructor parameter that's not exposed on the command line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want to expose this anywhere - I want the freedom to change this if/when we decide using HTSJDK's duplicate scoring stuff was a bad idea in the future, without breaking any APIs.

Copy link
Contributor

@JoeVieira JoeVieira Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things in here -
1.) hard-coding this seems like a really odd choice, we could wrap it if you're concerned about leaking API.
2.) There isn't a tie break any longer so, choosing amongst == baseQs will be random, yeah? which is why mapQ was included originally. I think the tests are passing now just because of the passed in sort order.

this is more tidy logic, nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a point re duplicate scoring strategies - the core strategies for choosing the representative read, much like UMI clustering algorithms are fairly established in literature & while having others seems possible, not supporting established ones, seems like a limitation which would be self imposed, I'm happy to wrap these enums if that's a need from the projects perspective so that the HTSJDK api isn't leaking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, this did make me laugh a little. I would agree if deep thought had gone into the original duplicate scoring strategies ... but as the author of that original code (in Picard that moved into HTSJDK) I can tell you that it was me making it up on the fly and coming up with something vaguely sensible!

Over time I tend to think that the duplicate marking stuff in Picard has gotten far too complicated.

I'd also argue that there's little to no value tie-breaking on mapq. If the reads are considered duplicates of each other and have either a) the same sum of base qualities > Q15, or b) the same mapped reference length, there's a very high probability they also have the same sum of mapqs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One options seems fine. I think it's perfectly fine that that is sum of base q, nothing about picards implementation of these scoring strategies is complex, the usages deep in the code get to be rather complex, but this simple score calculator just seems to be good code reuse to me.

I'm happy to look at how map q changes of note my testing is with bowtie2 mostly - so that certainly is impacting my perspective here.

I'll stop pushing here & just test a few more things, give some perspective with data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JoeVieira ... please let me know when you've had a chance to look at your data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne it'll be a little bit before I'm able to dig back into that data. But will in the next week or so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be sensible to choose the read with highest mapq and breaking ties randomly.

@JoeVieira did you have any suggestions after reviewing some data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry all, I have no been able to review data at scale yet. It's on my list for next week i promise.

@tfenne tfenne requested a review from nh13 September 29, 2023 12:02
@@ -499,6 +519,10 @@ class GroupReadsByUmi

private val assigner = strategy.newStrategy(this.edits)

// Give values to unset parameters that are different in duplicate marking mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupling these seems odd to me & deviates from the intended change set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a bit more context of why it seems odd? My intuition is that if you're marking duplicates with this tool, you want to keep all reads (i.e. don't pre-filter based on mapq). Similarly, you want to set the duplicate flag on secondary and supplementary reads (and don't prefilter them) too. What am I missing?

Also, what's the "intended change set"?

Copy link
Contributor

@JoeVieira JoeVieira Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the tool allows for multiple versions of filtering, allowing flexibility to choose what one is filtering & not seems like it's giving appropriate control over the functionality, rather than forcing a workflow.

It's true that, if one is keeping all reads, one must update all reads, but forcing all reads to stay, when marking doesn't seem like a requirement for that. The forcing to keep all is the odd part from my perspective.

Of course one could filter reads after doing this, if so inclined, but maintaining flexibility, rather than forcing workflows seems like a functional win for the toolset.

Copy link
Member

@nh13 nh13 Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any option is "forced", it just has a default when the option is unset. You can always set the option and this will override the default. For example, you can explicitly set --include-supplementary=true or --include-suppementary=false. I think that's why the types on the command line parameters are Option[Boolean] with default None:

  1. None -> use a default
  2. Some(true) -> include supplementary
  3. Some(false) -> filter out supplementary

Copy link
Contributor

@JoeVieira JoeVieira Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I'm being overly dramatic with "forced" & you're right, it can be overidden, you're the maintainers so the paradigm the tool uses is ultimately up to you.

What just strikes me as odd to change defaults when semi related parameters are passed - they aren't strictly coupled parameters, if they were strictly coupled i'd agree - i think this makes for a usage that implies a specific workflow & causes users to have to more closely read the docs if their workflow isn't that one.

As a user, when defaults change because of other parameters it can result in unexpected results - which are often hard to debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, the alternative is to provide a second command line tool that performs the duplicate marking that is just a facade over group reads by umi, but that seems a bit heavy handed. I'll let @tfenne take this discussion over, as I don't have strong feelings either way, but would lean to the current implementation as it is well-documented in the usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because otherwise, if one wants to use GroupReadsByUmi as an equivalent to MarkDuplicates in picard, one would have had to add:

--mark-duplicates --min-mapq 0 --include-secondary --include-supplementary

which to me seems error prone. I don't see the harm in defaulting arguments one way when the intended usage is for consensus calling and another when the intended usage is for duplicate marking. The user is always free to set them explicitly.

@JoeVieira
Copy link
Contributor

@tfenne appreciate you tackling it.

I certainly would have preferred feedback rather than this, but i'm happy it's getting attention - this introduces some oddities that were not present in my change set, while a couple of suggested edits on my PR would have gotten your point of using groupby instead of sorting across just fine.

private def setDuplicateFlags(group: Seq[Template]): Unit = {
val nonDuplicateTemplate = group.maxBy { template =>
template.primaryReads.sumBy { r =>
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)
r.mapq

@JoeVieira
Copy link
Contributor

Another thing to consider here - which I was planning on layering in after this PR is also allowing unmapped reads through this as well. ( which will necessitate an adjustment to CallConcensusReads also )

Removing the filtering & ensuring these reads get skipped for UMI grouping & just passed through would be how I was considering doing this.

@tfenne tfenne self-assigned this Oct 4, 2023
Added filtering of unmapped reads and reads with unmapped mates in the consensus calling iterator.
Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Until we have a second duplicate scoring strategy that is not random, I agree that setting to sum of base qualities from htsjdk is fine. It is also a proxy for aligned length, as well as read length, which can vary even when the outer coordinates are the same.

Aside: I would agree that "by mapping quality" is effectively random, since read pairs that are duplicates of each other are almost surely aligned the same by virtue of being duplicates of each other and therefore having the same mapq by current aligners.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I did add a commit (I hope that's ok) to make it work with codecov (which now require s a token, even for public repos)

@tfenne tfenne merged commit 6fa1fde into main Oct 26, 2023
6 checks passed
@tfenne tfenne deleted the tf_group_reads_marks_dupes branch October 26, 2023 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants