Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative PR to add duplicate marking to GroupReadsByUmi #940

Merged
merged 24 commits into from
Oct 26, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
180b29f
consolodate imports
JoeVieira Aug 29, 2023
596b9eb
consolodate imports
JoeVieira Aug 29, 2023
f7b72ab
Logic to calculate scores duplicate scores using htslib duplication d…
JoeVieira Aug 29, 2023
6d62bfc
basic test for duplicate marking
JoeVieira Aug 29, 2023
5bfc4a0
use mark dup flag to gate functionality, reorganize comments
JoeVieira Aug 29, 2023
61bf600
single ended tests
JoeVieira Aug 29, 2023
0cf6ddd
simplify comparison and score calculation into one method, using sort…
JoeVieira Aug 30, 2023
b9c075f
comments
JoeVieira Aug 30, 2023
0a702e7
adjustments to tests for more explicit tie breaking.
JoeVieira Aug 30, 2023
ccd6882
ensure dups are not marked if flag not passed
JoeVieira Aug 30, 2023
8e49514
remove extra linebreak
JoeVieira Aug 31, 2023
90da16f
optionally include secondary & supplementary reads.
JoeVieira Aug 31, 2023
25e8c5e
doc strings / usage docs
JoeVieira Sep 1, 2023
cc9f305
remove extra line break
JoeVieira Sep 1, 2023
02a93ad
formatting
JoeVieira Sep 1, 2023
39be56d
formatting
JoeVieira Sep 1, 2023
c79d7bd
clarify duplicate comments
JoeVieira Sep 12, 2023
5a75724
Tidied up implementation of duplicate marking in GroupReadsByUmi.
tfenne Sep 28, 2023
f143c95
Cleanup
tfenne Sep 28, 2023
cc1cb99
Review fixup
tfenne Sep 29, 2023
126009b
Update src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala
tfenne Sep 29, 2023
a637753
Added mapq as part of the scoring of duplicate reads.
tfenne Oct 25, 2023
7161c4a
Update usage docs a little bit.
tfenne Oct 25, 2023
61b46b1
Add codecov token
nh13 Oct 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,12 @@ class ConsensusCallingIterator[ConsensusRead <: SimpleRead](sourceIterator: Iter
private var collectedStats: Boolean = false

protected val iter: Iterator[SamRecord] = {
val filteredIterator = sourceIterator.filterNot(r => r.secondary || r.supplementary)

// Wrap our input iterator in a progress logging iterator if we have a progress logger
val progressIterator = progress match {
case Some(p) => sourceIterator.tapEach { r => p.record(r) }
case None => sourceIterator
case Some(p) => filteredIterator.tapEach { r => p.record(r) }
case None => filteredIterator
}

// Then turn it into a grouping iterator
Expand Down
95 changes: 67 additions & 28 deletions src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,10 @@
*/
package com.fulcrumgenomics.umi

import java.util.concurrent.atomic.AtomicLong
import com.fulcrumgenomics.FgBioDef._
import com.fulcrumgenomics.bam.api.SamOrder.TemplateCoordinate
import com.fulcrumgenomics.bam.{Bams, Template}
import com.fulcrumgenomics.bam.api.{SamOrder, SamRecord, SamSource, SamWriter}
import com.fulcrumgenomics.bam.{Bams, Template}
import com.fulcrumgenomics.cmdline.{ClpGroups, FgBioTool}
import com.fulcrumgenomics.commons.util.{LazyLogging, NumericCounter, SimpleCounter}
import com.fulcrumgenomics.sopt.{arg, clp}
Expand All @@ -39,13 +38,14 @@ import com.fulcrumgenomics.util.Metric.{Count, Proportion}
import com.fulcrumgenomics.util.Sequences.countMismatches
import com.fulcrumgenomics.util._
import enumeratum.EnumEntry
import htsjdk.samtools.DuplicateScoringStrategy.ScoringStrategy
import htsjdk.samtools._
import htsjdk.samtools.util.SequenceUtil

import scala.collection.BufferedIterator
import java.util.concurrent.atomic.AtomicLong
import scala.collection.immutable.IndexedSeq
import scala.collection.mutable
import scala.collection.mutable.ListBuffer
import scala.collection.{BufferedIterator, Iterator, mutable}


object GroupReadsByUmi {
Expand Down Expand Up @@ -439,9 +439,13 @@ object Strategy extends FgBioEnum[Strategy] {
| 3. The assigned UMI tag
| 4. Read Name
|
|If the input is not template-coordinate sorted (i.e. `SO:unsorted GO:query SS:unsorted:template-coordinate`), then
|this tool will re-sort the input. The output will be written in template-coordinate order.
|
|During grouping, reads are filtered out if a) all reads with the same queryname are unmapped, b) any primary
|read has mapping quality < `min-map-q` (default=1), or c) the primary mappings for R1 and R2 are on different
|chromosomes and `--allow-inter-contig` has been set to false.
|read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered,
|chromosomes and `--allow-inter-contig` has been set to false, or d.) all non-primary reads are filtered,

|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false)
|if `--include-secondary` and\or `--include-supplementary` are set to false (default=false)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't grok 446-447 after multiple attempts at reading the usage (before looking at the code). Perhaps you don't mean "non-primary"? And if not, then I'm really confused!

|
|Grouping of UMIs is performed by one of four strategies:
|
Expand Down Expand Up @@ -470,27 +474,38 @@ object Strategy extends FgBioEnum[Strategy] {
|(i.e. does not count dashes and other non-ACGT characters). This option is not implemented for reads with UMI pairs
|(i.e. using the paired assigner).
|
|If the input is not template-coordinate sorted (i.e. `SO:unsorted GO:query SS:unsorted:template-coordinate`), then
|this tool will re-sort the input. The ouitput will be written in template-coordinate order.
|If the `--mark-duplicates` option is given, reads will also have their duplicate flag set in the BAM file.
|Each tag-family is treated separately, and a single template within the tag family is chosen to be the "unique"
|template and marked as non-duplicate, while all other templates in the tag family are then marked as duplicate.
|
|Several parameters have different defaults depending on whether duplicates are being marked or not (all are
|directly settable on the command line):
|
| 1. `--min-map-q` defaults to 0 in duplicate marking mode and 1 otherwise
| 2. `--include-secondary` defaults to true in duplicate marking mode and false otherwise
| 3. `--include-suppementary` defaults to true in duplicate marking mode and false otherwise
tfenne marked this conversation as resolved.
Show resolved Hide resolved
"""
)
class GroupReadsByUmi
( @arg(flag='i', doc="The input BAM file.") val input: PathToBam = Io.StdIn,
@arg(flag='o', doc="The output BAM file.") val output: PathToBam = Io.StdOut,
@arg(flag='f', doc="Optional output of tag family size counts.") val familySizeHistogram: Option[FilePath] = None,
@arg(flag='t', doc="The tag containing the raw UMI.") val rawTag: String = "RX",
@arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI",
@arg(flag='m', doc="Minimum mapping quality for mapped reads.") val minMapQ: Int = 1,
@arg(flag='n', doc="Include non-PF reads.") val includeNonPfReads: Boolean = false,
@arg(flag='s', doc="The UMI assignment strategy.") val strategy: Strategy,
@arg(flag='e', doc="The allowable number of edits between UMIs.") val edits: Int = 1,
@arg(flag='l', doc= """The minimum UMI length. If not specified then all UMIs must have the same length,
|otherwise discard reads with UMIs shorter than this length and allow for differing UMI lengths.
|""")
val minUmiLength: Option[Int] = None,
@arg(flag='x', doc= """
|DEPRECATED: this option will be removed in future versions and inter-contig reads will be
|automatically processed.""")
(@arg(flag='i', doc="The input BAM file.") val input: PathToBam = Io.StdIn,
@arg(flag='o', doc="The output BAM file.") val output: PathToBam = Io.StdOut,
@arg(flag='f', doc="Optional output of tag family size counts.") val familySizeHistogram: Option[FilePath] = None,
@arg(flag='t', doc="The tag containing the raw UMI.") val rawTag: String = "RX",
@arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we finally update these to use

and ?

@arg(flag='d', doc="Turn on duplicate marking mode.") val markDuplicates: Boolean = false,
@arg(flag='S', doc="Include secondary reads.") val includeSecondary: Option[Boolean] = None,
@arg(flag='U', doc="Include supplementary reads.") val includeSupplementary: Option[Boolean] = None,
@arg(flag='m', doc="Minimum mapping quality for mapped reads.") val minMapQ: Option[Int] = None,
@arg(flag='n', doc="Include non-PF reads.") val includeNonPfReads: Boolean = false,
@arg(flag='s', doc="The UMI assignment strategy.") val strategy: Strategy,
@arg(flag='e', doc="The allowable number of edits between UMIs.") val edits: Int = 1,
@arg(flag='l', doc= """The minimum UMI length. If not specified then all UMIs must have the same length,
|otherwise discard reads with UMIs shorter than this length and allow for differing UMI lengths.
|""")
val minUmiLength: Option[Int] = None,
@arg(flag='x', doc= """
|DEPRECATED: this option will be removed in future versions and inter-contig reads will be
|automatically processed.""")
@deprecated val allowInterContig: Boolean = true
)extends FgBioTool with LazyLogging {
import GroupReadsByUmi._
Expand All @@ -499,6 +514,10 @@ class GroupReadsByUmi

private val assigner = strategy.newStrategy(this.edits)

// Give values to unset parameters that are different in duplicate marking mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupling these seems odd to me & deviates from the intended change set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a bit more context of why it seems odd? My intuition is that if you're marking duplicates with this tool, you want to keep all reads (i.e. don't pre-filter based on mapq). Similarly, you want to set the duplicate flag on secondary and supplementary reads (and don't prefilter them) too. What am I missing?

Also, what's the "intended change set"?

Copy link
Contributor

@JoeVieira JoeVieira Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the tool allows for multiple versions of filtering, allowing flexibility to choose what one is filtering & not seems like it's giving appropriate control over the functionality, rather than forcing a workflow.

It's true that, if one is keeping all reads, one must update all reads, but forcing all reads to stay, when marking doesn't seem like a requirement for that. The forcing to keep all is the odd part from my perspective.

Of course one could filter reads after doing this, if so inclined, but maintaining flexibility, rather than forcing workflows seems like a functional win for the toolset.

Copy link
Member

@nh13 nh13 Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any option is "forced", it just has a default when the option is unset. You can always set the option and this will override the default. For example, you can explicitly set --include-supplementary=true or --include-suppementary=false. I think that's why the types on the command line parameters are Option[Boolean] with default None:

  1. None -> use a default
  2. Some(true) -> include supplementary
  3. Some(false) -> filter out supplementary

Copy link
Contributor

@JoeVieira JoeVieira Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I'm being overly dramatic with "forced" & you're right, it can be overidden, you're the maintainers so the paradigm the tool uses is ultimately up to you.

What just strikes me as odd to change defaults when semi related parameters are passed - they aren't strictly coupled parameters, if they were strictly coupled i'd agree - i think this makes for a usage that implies a specific workflow & causes users to have to more closely read the docs if their workflow isn't that one.

As a user, when defaults change because of other parameters it can result in unexpected results - which are often hard to debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, the alternative is to provide a second command line tool that performs the duplicate marking that is just a facade over group reads by umi, but that seems a bit heavy handed. I'll let @tfenne take this discussion over, as I don't have strong feelings either way, but would lean to the current implementation as it is well-documented in the usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because otherwise, if one wants to use GroupReadsByUmi as an equivalent to MarkDuplicates in picard, one would have had to add:

--mark-duplicates --min-mapq 0 --include-secondary --include-supplementary

which to me seems error prone. I don't see the harm in defaulting arguments one way when the intended usage is for consensus calling and another when the intended usage is for duplicate marking. The user is always free to set them explicitly.

private val _minMapQ = this.minMapQ.getOrElse(if (this.markDuplicates) 0 else 1)
private val _includeSecondaryReads = this.includeSecondary.getOrElse(this.markDuplicates)
private val _includeSupplementaryReads = this.includeSupplementary.getOrElse(this.markDuplicates)

/** Checks that the read's mapq is over a minimum, and if the read is paired, that the mate mapq is also over the min. */
private def mapqOk(rec: SamRecord, minMapQ: Int): Boolean = {
Expand Down Expand Up @@ -546,11 +565,12 @@ class GroupReadsByUmi
// Filter and sort the input BAM file
logger.info("Filtering the input.")
val filteredIterator = in.iterator
.filter(r => !r.secondary && !r.supplementary)
.filter(r => this._includeSecondaryReads || !r.secondary )
.filter(r => this._includeSupplementaryReads || !r.supplementary )
.filter(r => (includeNonPfReads || r.pf) || { filteredNonPf += 1; false })
.filter(r => (r.mapped || (r.paired && r.mateMapped)) || { filteredPoorAlignment += 1; false })
.filter(r => (allowInterContig || r.unpaired || r.refIndex == r.mateRefIndex) || { filteredPoorAlignment += 1; false })
.filter(r => mapqOk(r, this.minMapQ) || { filteredPoorAlignment += 1; false })
.filter(r => mapqOk(r, this._minMapQ) || { filteredPoorAlignment += 1; false })
.filter(r => !r.get[String](rawTag).exists(_.contains('N')) || { filteredNsInUmi += 1; false })
.filter { r =>
this.minUmiLength.forall { l =>
Expand Down Expand Up @@ -610,10 +630,15 @@ class GroupReadsByUmi
// Take the next set of templates by position and assign UMIs
val templates = takeNextGroup(templateCoordinateIterator, canTakeNextGroupByUmi=canTakeNextGroupByUmi)
assignUmiGroups(templates)

// Then output the records in the right order (assigned tag, read name, r1, r2)
// Then group the records in the right order (assigned tag, read name, r1, r2)
val templatesByMi = templates.groupBy { t => t.r1.get.apply[String](this.assignTag) }

// If marking duplicates, assign bitflag to all duplicate reads
if (this.markDuplicates) {
templatesByMi.values.foreach(t => setDuplicateFlags(t))
}

// Then output the records in the right order (assigned tag, read name, r1, r2)
templatesByMi.keys.toSeq.sortBy(id => (id.length, id)).foreach(tag => {
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.primaryReads).foreach(rec => {
out += rec
Expand Down Expand Up @@ -731,4 +756,18 @@ class GroupReadsByUmi
r1[String](this.rawTag)
}
}

/** Sets the duplicate flags on all reads within all templates. */
private def setDuplicateFlags(group: Seq[Template]): Unit = {
val nonDuplicateTemplate = group.minBy { template =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want maxBy here? To my eye, htsjdk's duplicate score returns higher values as better:

https://github.com/samtools/htsjdk/blob/38dbd0e6b87a2bd5791bb81399f2b5ec1008773e/src/main/java/htsjdk/samtools/DuplicateScoringStrategy.java#L73-L116

For example, if the duplicate score is the sum of base qualities, then don't we want the template with the highest sum of base qualities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

template.primaryReads.sumBy { r =>
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a compromise, consider setting the ScoringStrategy as a constructor parameter that's not exposed on the command line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want to expose this anywhere - I want the freedom to change this if/when we decide using HTSJDK's duplicate scoring stuff was a bad idea in the future, without breaking any APIs.

Copy link
Contributor

@JoeVieira JoeVieira Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things in here -
1.) hard-coding this seems like a really odd choice, we could wrap it if you're concerned about leaking API.
2.) There isn't a tie break any longer so, choosing amongst == baseQs will be random, yeah? which is why mapQ was included originally. I think the tests are passing now just because of the passed in sort order.

this is more tidy logic, nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a point re duplicate scoring strategies - the core strategies for choosing the representative read, much like UMI clustering algorithms are fairly established in literature & while having others seems possible, not supporting established ones, seems like a limitation which would be self imposed, I'm happy to wrap these enums if that's a need from the projects perspective so that the HTSJDK api isn't leaking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, this did make me laugh a little. I would agree if deep thought had gone into the original duplicate scoring strategies ... but as the author of that original code (in Picard that moved into HTSJDK) I can tell you that it was me making it up on the fly and coming up with something vaguely sensible!

Over time I tend to think that the duplicate marking stuff in Picard has gotten far too complicated.

I'd also argue that there's little to no value tie-breaking on mapq. If the reads are considered duplicates of each other and have either a) the same sum of base qualities > Q15, or b) the same mapped reference length, there's a very high probability they also have the same sum of mapqs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One options seems fine. I think it's perfectly fine that that is sum of base q, nothing about picards implementation of these scoring strategies is complex, the usages deep in the code get to be rather complex, but this simple score calculator just seems to be good code reuse to me.

I'm happy to look at how map q changes of note my testing is with bowtie2 mostly - so that certainly is impacting my perspective here.

I'll stop pushing here & just test a few more things, give some perspective with data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JoeVieira ... please let me know when you've had a chance to look at your data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne it'll be a little bit before I'm able to dig back into that data. But will in the next week or so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be sensible to choose the read with highest mapq and breaking ties randomly.

@JoeVieira did you have any suggestions after reviewing some data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry all, I have no been able to review data at scale yet. It's on my list for next week i promise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES)
r.mapq

}
}

group.foreach { template =>
val flag = !(template eq nonDuplicateTemplate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val flag = !(template eq nonDuplicateTemplate)
val flag = template ne nonDuplicateTemplate

template.allReads.foreach(_.duplicate = flag)
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ class GroupReadsByUmiTest extends UnitSpec with OptionValues with PrivateMethodT
val in = builder.toTempFile()
val out = Files.createTempFile("umi_grouped.", ".sam")
val hist = Files.createTempFile("umi_grouped.", ".histogram.txt")
val tool = new GroupReadsByUmi(input=in, output=out, familySizeHistogram=Some(hist), rawTag="RX", assignTag="MI", strategy=Strategy.Edit, edits=1, minMapQ=30)
val tool = new GroupReadsByUmi(input=in, output=out, familySizeHistogram=Some(hist), rawTag="RX", assignTag="MI", strategy=Strategy.Edit, edits=1, minMapQ=Some(30))
val logs = executeFgbioTool(tool)

val groups = readBamRecs(out).groupBy(_.name.charAt(0))
Expand All @@ -255,6 +255,67 @@ class GroupReadsByUmiTest extends UnitSpec with OptionValues with PrivateMethodT
}
}

it should "correctly mark duplicates on duplicate reads in group, when flag is passed" in {
val builder = new SamBuilder(readLength = 100, sort = Some(SamOrder.Coordinate))
// Mapping Quality is a tie breaker, so use that to our advantage here.
builder.addPair(mapq1 = 10, mapq2 = 10, name = "a01", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))
builder.addPair(mapq1 = 30, mapq2 = 30, name = "a02", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))
builder.addPair(mapq1 = 100, mapq2 = 10, name = "a03", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))
builder.addPair(mapq1 = 0, mapq2 = 0, name = "a04", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))

val in = builder.toTempFile()
val out = Files.createTempFile("umi_grouped.", ".sam")
val hist = Files.createTempFile("umi_grouped.", ".histogram.txt")
val gr = new GroupReadsByUmi(input=in, output=out, familySizeHistogram=Some(hist), strategy=Strategy.Paired, edits=1, markDuplicates=true)

gr.markDuplicates shouldBe true

val recs = readBamRecs(out)
recs.filter(_.name.equals("a01")).forall(_.duplicate == true) shouldBe true
recs.filter(_.name.equals("a02")).forall(_.duplicate == true) shouldBe true
recs.filter(_.name.equals("a03")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("a04")).forall(_.duplicate == true) shouldBe true
}

it should "does not mark duplicates on reads in group, when flag is not passed" in {
val builder = new SamBuilder(readLength = 100, sort = Some(SamOrder.Coordinate))
// Mapping Quality is a tie breaker, so use that to our advantage here.
builder.addPair(mapq1 = 10, mapq2 = 10, name = "a01", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))
builder.addPair(mapq1 = 30, mapq2 = 30, name = "a02", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))
builder.addPair(mapq1 = 100, mapq2 = 10, name = "a03", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))
builder.addPair(mapq1 = 0, mapq2 = 0, name = "a04", start1 = 100, start2 = 300, strand1 = Plus, strand2 = Minus, attrs = Map("RX" -> "ACT-ACT"))

val in = builder.toTempFile()
val out = Files.createTempFile("umi_grouped.", ".sam")
val hist = Files.createTempFile("umi_grouped.", ".histogram.txt")
new GroupReadsByUmi(input=in, output=out, familySizeHistogram=Some(hist), strategy=Strategy.Paired, edits=1).execute()

val recs = readBamRecs(out)
recs.filter(_.name.equals("a01")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("a02")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("a03")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("a04")).forall(_.duplicate == false) shouldBe true
}

it should "correctly mark duplicates on duplicate single-end reads with UMIs" in {
val builder = new SamBuilder(readLength = 100, sort = Some(SamOrder.Coordinate))
builder.addFrag(mapq = 100, name = "a01", start = 100, attrs = Map("RX" -> "AAAAAAAA"))
builder.addFrag(mapq = 10, name = "a02", start = 100, attrs = Map("RX" -> "AAAAAAAA"))
builder.addFrag(mapq = 100, name = "a03", start = 100, attrs = Map("RX" -> "CACACACA"))
builder.addFrag(mapq = 10, name = "a04", start = 100, attrs = Map("RX" -> "CACACACC"))

val in = builder.toTempFile()
val out = Files.createTempFile("umi_grouped.", ".sam")
val hist = Files.createTempFile("umi_grouped.", ".histogram.txt")
new GroupReadsByUmi(input = in, output = out, familySizeHistogram = Some(hist), rawTag = "RX", assignTag = "MI", strategy = Strategy.Edit, edits = 1, markDuplicates = true).execute()

val recs = readBamRecs(out)
recs.filter(_.name.equals("a01")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("a02")).forall(_.duplicate == true) shouldBe true
recs.filter(_.name.equals("a03")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("a04")).forall(_.duplicate == true) shouldBe true
}

it should "correctly group reads with the paired assigner when the two UMIs are the same" in {
val builder = new SamBuilder(readLength=100, sort=Some(SamOrder.Coordinate))
builder.addPair(name="a01", start1=100, start2=300, strand1=Plus, strand2=Minus, attrs=Map("RX" -> "ACT-ACT"))
Expand Down