Alternative PR to add duplicate marking to GroupReadsByUmi #940

tfenne · 2023-09-28T23:18:09Z

@JoeVieira and @nh13 this is my attempt to provide a cleaned up version of #934. I figured this was easier than trying to shepherd the original PR through with lots of comments - though I greatly appreciate you making that PR Joe as it showed me how small the changes actually are and motivated me to do this.

Some of the changes stylistic / scala idiom stuff. Functional changes include:

Auto-defaulting of minMapQ and including of secondary/supplementary based on whether --mark-duplicates is provided
I chose not to expose the duplicate ScoringStrategy as I'm not sure I want to tie ourselves to that, and just picked one to use for now
It now also reset the duplicate flag to false on the chosen representative template's reads

Edited to add: Oh, and also changing the code downstream to avoid the consensus callers seeing secondary/supplementary reads should a BAM generated with --mark-duplicates get sent downstream.

…etection strategies. Set duplicate flag as needed on all reads inside Template which are not the top scoring read set.

…With.

tfenne · 2023-09-28T23:19:46Z

Tests could still use some reformatting to follow convention of other code.

codecov-commenter · 2023-09-28T23:21:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (bcb5577) 95.60% compared to head (61b46b1) 95.65%.
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #940      +/-   ##
==========================================
+ Coverage   95.60%   95.65%   +0.05%     
==========================================
  Files         126      126              
  Lines        7305     7460     +155     
  Branches      517      526       +9     
==========================================
+ Hits         6984     7136     +152     
- Misses        321      324       +3

Flag	Coverage Δ
unittests	`95.65% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...fulcrumgenomics/umi/ConsensusCallingIterator.scala	`100.00% <100.00%> (ø)`
...cala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala	`95.58% <100.00%> (+0.22%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nh13

A few things I'm confused about, but looking really good!

nh13 · 2023-09-28T23:37:18Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

-    |chromosomes and `--allow-inter-contig` has been set to false.
+    |read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different
+    |chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,
+    |if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false)


Suggested change

|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false)

|if `--include-secondary` and\or `--include-supplementary` are set to false (default=false)

I don't grok 446-447 after multiple attempts at reading the usage (before looking at the code). Perhaps you don't mean "non-primary"? And if not, then I'm really confused!

nh13 · 2023-09-28T23:37:52Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

-    |read has mapping quality < `min-map-q` (default=1), or c) the primary mappings for R1 and R2 are on different
-    |chromosomes and `--allow-inter-contig` has been set to false.
+    |read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different
+    |chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,


Suggested change

|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,

|chromosomes and `--allow-inter-contig` has been set to false, or d.) all non-primary reads are filtered,

nh13 · 2023-09-28T23:41:17Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

+ @arg(flag='t', doc="The tag containing the raw UMI.")  val rawTag: String    = "RX",
+ @arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI",


Shall we finally update these to use

fgbio/src/main/scala/com/fulcrumgenomics/umi/ConsensusTags.scala

Line 46 in d748c76

val UmiBases = "RX"

and

fgbio/src/main/scala/com/fulcrumgenomics/umi/ConsensusTags.scala

Line 54 in d748c76

val MolecularId = "MI"

?

nh13 · 2023-09-28T23:48:11Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

+
+  /** Sets the duplicate flags on all reads within all templates.  */
+  private def setDuplicateFlags(group: Seq[Template]): Unit = {
+    val nonDuplicateTemplate = group.minBy { template =>


Do we want maxBy here? To my eye, htsjdk's duplicate score returns higher values as better:

https://github.com/samtools/htsjdk/blob/38dbd0e6b87a2bd5791bb81399f2b5ec1008773e/src/main/java/htsjdk/samtools/DuplicateScoringStrategy.java#L73-L116

For example, if the duplicate score is the sum of base qualities, then don't we want the template with the highest sum of base qualities?

Good catch!

nh13 · 2023-09-28T23:49:08Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

+    }
+
+    group.foreach { template =>
+      val flag = !(template eq nonDuplicateTemplate)


Suggested change

val flag = !(template eq nonDuplicateTemplate)

val flag = template ne nonDuplicateTemplate

nh13 · 2023-09-28T23:49:39Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

+  private def setDuplicateFlags(group: Seq[Template]): Unit = {
+    val nonDuplicateTemplate = group.minBy { template =>
+      template.primaryReads.sumBy { r =>
+        DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)


As a compromise, consider setting the ScoringStrategy as a constructor parameter that's not exposed on the command line?

I really don't want to expose this anywhere - I want the freedom to change this if/when we decide using HTSJDK's duplicate scoring stuff was a bad idea in the future, without breaking any APIs.

A few things in here -
1.) hard-coding this seems like a really odd choice, we could wrap it if you're concerned about leaking API.
2.) There isn't a tie break any longer so, choosing amongst == baseQs will be random, yeah? which is why mapQ was included originally. I think the tests are passing now just because of the passed in sort order.

this is more tidy logic, nice.

Just as a point re duplicate scoring strategies - the core strategies for choosing the representative read, much like UMI clustering algorithms are fairly established in literature & while having others seems possible, not supporting established ones, seems like a limitation which would be self imposed, I'm happy to wrap these enums if that's a need from the projects perspective so that the HTSJDK api isn't leaking.

Lol, this did make me laugh a little. I would agree if deep thought had gone into the original duplicate scoring strategies ... but as the author of that original code (in Picard that moved into HTSJDK) I can tell you that it was me making it up on the fly and coming up with something vaguely sensible!

Over time I tend to think that the duplicate marking stuff in Picard has gotten far too complicated.

I'd also argue that there's little to no value tie-breaking on mapq. If the reads are considered duplicates of each other and have either a) the same sum of base qualities > Q15, or b) the same mapped reference length, there's a very high probability they also have the same sum of mapqs.

One options seems fine. I think it's perfectly fine that that is sum of base q, nothing about picards implementation of these scoring strategies is complex, the usages deep in the code get to be rather complex, but this simple score calculator just seems to be good code reuse to me.

I'm happy to look at how map q changes of note my testing is with bowtie2 mostly - so that certainly is impacting my perspective here.

I'll stop pushing here & just test a few more things, give some perspective with data.

Thanks @JoeVieira ... please let me know when you've had a chance to look at your data.

@tfenne it'll be a little bit before I'm able to dig back into that data. But will in the next week or so.

I think it would be sensible to choose the read with highest mapq and breaking ties randomly.

@JoeVieira did you have any suggestions after reviewing some data?

Sorry all, I have no been able to review data at scale yet. It's on my list for next week i promise.

JoeVieira · 2023-09-29T15:25:06Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

@@ -499,6 +519,10 @@ class GroupReadsByUmi

  private val assigner = strategy.newStrategy(this.edits)

+  // Give values to unset parameters that are different in duplicate marking mode


Coupling these seems odd to me & deviates from the intended change set.

Can you provide a bit more context of why it seems odd? My intuition is that if you're marking duplicates with this tool, you want to keep all reads (i.e. don't pre-filter based on mapq). Similarly, you want to set the duplicate flag on secondary and supplementary reads (and don't prefilter them) too. What am I missing?

Also, what's the "intended change set"?

Given that the tool allows for multiple versions of filtering, allowing flexibility to choose what one is filtering & not seems like it's giving appropriate control over the functionality, rather than forcing a workflow.

It's true that, if one is keeping all reads, one must update all reads, but forcing all reads to stay, when marking doesn't seem like a requirement for that. The forcing to keep all is the odd part from my perspective.

Of course one could filter reads after doing this, if so inclined, but maintaining flexibility, rather than forcing workflows seems like a functional win for the toolset.

I don't think any option is "forced", it just has a default when the option is unset. You can always set the option and this will override the default. For example, you can explicitly set --include-supplementary=true or --include-suppementary=false. I think that's why the types on the command line parameters are Option[Boolean] with default None:

None -> use a default

Some(true) -> include supplementary

Some(false) -> filter out supplementary

Fair. I'm being overly dramatic with "forced" & you're right, it can be overidden, you're the maintainers so the paradigm the tool uses is ultimately up to you.

What just strikes me as odd to change defaults when semi related parameters are passed - they aren't strictly coupled parameters, if they were strictly coupled i'd agree - i think this makes for a usage that implies a specific workflow & causes users to have to more closely read the docs if their workflow isn't that one.

As a user, when defaults change because of other parameters it can result in unexpected results - which are often hard to debug.

Fair, the alternative is to provide a second command line tool that performs the duplicate marking that is just a facade over group reads by umi, but that seems a bit heavy handed. I'll let @tfenne take this discussion over, as I don't have strong feelings either way, but would lean to the current implementation as it is well-documented in the usage.

I made this change because otherwise, if one wants to use GroupReadsByUmi as an equivalent to MarkDuplicates in picard, one would have had to add:

--mark-duplicates --min-mapq 0 --include-secondary --include-supplementary

which to me seems error prone. I don't see the harm in defaulting arguments one way when the intended usage is for consensus calling and another when the intended usage is for duplicate marking. The user is always free to set them explicitly.

JoeVieira · 2023-09-29T15:55:55Z

@tfenne appreciate you tackling it.

I certainly would have preferred feedback rather than this, but i'm happy it's getting attention - this introduces some oddities that were not present in my change set, while a couple of suggested edits on my PR would have gotten your point of using groupby instead of sorting across just fine.

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

Co-authored-by: Nils Homer <[email protected]>

JoeVieira · 2023-09-29T20:03:15Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

+  private def setDuplicateFlags(group: Seq[Template]): Unit = {
+    val nonDuplicateTemplate = group.maxBy { template =>
+      template.primaryReads.sumBy { r =>
+        DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)


Suggested change

DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)

r.mapq

JoeVieira · 2023-10-03T15:45:50Z

Another thing to consider here - which I was planning on layering in after this PR is also allowing unmapped reads through this as well. ( which will necessitate an adjustment to CallConcensusReads also )

Removing the filtering & ensuring these reads get skipped for UMI grouping & just passed through would be how I was considering doing this.

Added filtering of unmapped reads and reads with unmapped mates in the consensus calling iterator.

nh13

LGTM. Until we have a second duplicate scoring strategy that is not random, I agree that setting to sum of base qualities from htsjdk is fine. It is also a proxy for aligned length, as well as read length, which can vary even when the outer coordinates are the same.

Aside: I would agree that "by mapping quality" is effectively random, since read pairs that are duplicates of each other are almost surely aligned the same by virtue of being duplicates of each other and therefore having the same mapq by current aligners.

nh13

LGTM. I did add a commit (I hope that's ok) to make it work with codecov (which now require s a token, even for public repos)

JoeVieira and others added 19 commits August 29, 2023 13:47

consolodate imports

180b29f

consolodate imports

596b9eb

Logic to calculate scores duplicate scores using htslib duplication d…

f7b72ab

…etection strategies. Set duplicate flag as needed on all reads inside Template which are not the top scoring read set.

basic test for duplicate marking

6d62bfc

use mark dup flag to gate functionality, reorganize comments

5bfc4a0

single ended tests

61bf600

simplify comparison and score calculation into one method, using sort…

0cf6ddd

…With.

comments

b9c075f

adjustments to tests for more explicit tie breaking.

0a702e7

ensure dups are not marked if flag not passed

ccd6882

remove extra linebreak

8e49514

optionally include secondary & supplementary reads.

90da16f

doc strings / usage docs

25e8c5e

remove extra line break

cc9f305

formatting

02a93ad

formatting

39be56d

clarify duplicate comments

c79d7bd

Tidied up implementation of duplicate marking in GroupReadsByUmi.

5a75724

Cleanup

f143c95

tfenne requested a review from nh13 September 28, 2023 23:18

nh13 requested changes Sep 28, 2023

View reviewed changes

Review fixup

cc1cb99

tfenne requested a review from nh13 September 29, 2023 12:02

JoeVieira reviewed Sep 29, 2023

View reviewed changes

nh13 reviewed Sep 29, 2023

View reviewed changes

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala Outdated Show resolved Hide resolved

Update src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

126009b

Co-authored-by: Nils Homer <[email protected]>

JoeVieira reviewed Sep 29, 2023

View reviewed changes

tfenne self-assigned this Oct 4, 2023

tfenne added 2 commits October 25, 2023 06:30

Added mapq as part of the scoring of duplicate reads.

a637753

Added filtering of unmapped reads and reads with unmapped mates in the consensus calling iterator.

Update usage docs a little bit.

7161c4a

nh13 approved these changes Oct 25, 2023

View reviewed changes

Add codecov token

61b46b1

nh13 approved these changes Oct 25, 2023

View reviewed changes

tfenne merged commit 6fa1fde into main Oct 26, 2023
6 checks passed

tfenne deleted the tf_group_reads_marks_dupes branch October 26, 2023 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative PR to add duplicate marking to GroupReadsByUmi #940

Alternative PR to add duplicate marking to GroupReadsByUmi #940

tfenne commented Sep 28, 2023 •

edited

Loading

tfenne commented Sep 28, 2023

codecov-commenter commented Sep 28, 2023 •

edited by codecov bot

Loading

nh13 left a comment

nh13 Sep 28, 2023

nh13 Sep 28, 2023

nh13 Sep 28, 2023

nh13 Sep 28, 2023

nh13 Sep 28, 2023

tfenne Sep 29, 2023

nh13 Sep 28, 2023

nh13 Sep 28, 2023

tfenne Sep 29, 2023

JoeVieira Sep 29, 2023 •

edited

Loading

JoeVieira Sep 29, 2023

tfenne Sep 29, 2023

JoeVieira Sep 29, 2023

tfenne Oct 4, 2023

JoeVieira Oct 5, 2023

msto Oct 19, 2023

JoeVieira Oct 20, 2023

JoeVieira Sep 29, 2023

nh13 Sep 29, 2023

JoeVieira Sep 29, 2023 •

edited

Loading

nh13 Sep 29, 2023 •

edited

Loading

JoeVieira Sep 29, 2023 •

edited

Loading

nh13 Sep 29, 2023

tfenne Sep 29, 2023

JoeVieira commented Sep 29, 2023

JoeVieira Sep 29, 2023

JoeVieira commented Oct 3, 2023

nh13 left a comment •

edited

Loading

nh13 left a comment

	\|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false)
	\|if `--include-secondary` and\or `--include-supplementary` are set to false (default=false)

	\|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,
	\|chromosomes and `--allow-inter-contig` has been set to false, or d.) all non-primary reads are filtered,

		@arg(flag='t', doc="The tag containing the raw UMI.") val rawTag: String = "RX",
		@arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI",

	val flag = !(template eq nonDuplicateTemplate)
	val flag = template ne nonDuplicateTemplate

		@@ -499,6 +519,10 @@ class GroupReadsByUmi

		private val assigner = strategy.newStrategy(this.edits)

		// Give values to unset parameters that are different in duplicate marking mode

	DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)
	r.mapq

Alternative PR to add duplicate marking to GroupReadsByUmi #940

Alternative PR to add duplicate marking to GroupReadsByUmi #940

Conversation

tfenne commented Sep 28, 2023 • edited Loading

tfenne commented Sep 28, 2023

codecov-commenter commented Sep 28, 2023 • edited by codecov bot Loading

Codecov Report

nh13 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeVieira Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeVieira Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

nh13 Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

JoeVieira Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeVieira commented Sep 29, 2023

Choose a reason for hiding this comment

JoeVieira commented Oct 3, 2023

nh13 left a comment • edited Loading

Choose a reason for hiding this comment

nh13 left a comment

Choose a reason for hiding this comment

tfenne commented Sep 28, 2023 •

edited

Loading

codecov-commenter commented Sep 28, 2023 •

edited by codecov bot

Loading

JoeVieira Sep 29, 2023 •

edited

Loading

JoeVieira Sep 29, 2023 •

edited

Loading

nh13 Sep 29, 2023 •

edited

Loading

JoeVieira Sep 29, 2023 •

edited

Loading

nh13 left a comment •

edited

Loading