-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative PR to add duplicate marking to GroupReadsByUmi #940
Changes from 19 commits
180b29f
596b9eb
f7b72ab
6d62bfc
5bfc4a0
61bf600
0cf6ddd
b9c075f
0a702e7
ccd6882
8e49514
90da16f
25e8c5e
cc9f305
02a93ad
39be56d
c79d7bd
5a75724
f143c95
cc1cb99
126009b
a637753
7161c4a
61b46b1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -26,11 +26,10 @@ | |||||
*/ | ||||||
package com.fulcrumgenomics.umi | ||||||
|
||||||
import java.util.concurrent.atomic.AtomicLong | ||||||
import com.fulcrumgenomics.FgBioDef._ | ||||||
import com.fulcrumgenomics.bam.api.SamOrder.TemplateCoordinate | ||||||
import com.fulcrumgenomics.bam.{Bams, Template} | ||||||
import com.fulcrumgenomics.bam.api.{SamOrder, SamRecord, SamSource, SamWriter} | ||||||
import com.fulcrumgenomics.bam.{Bams, Template} | ||||||
import com.fulcrumgenomics.cmdline.{ClpGroups, FgBioTool} | ||||||
import com.fulcrumgenomics.commons.util.{LazyLogging, NumericCounter, SimpleCounter} | ||||||
import com.fulcrumgenomics.sopt.{arg, clp} | ||||||
|
@@ -39,13 +38,14 @@ import com.fulcrumgenomics.util.Metric.{Count, Proportion} | |||||
import com.fulcrumgenomics.util.Sequences.countMismatches | ||||||
import com.fulcrumgenomics.util._ | ||||||
import enumeratum.EnumEntry | ||||||
import htsjdk.samtools.DuplicateScoringStrategy.ScoringStrategy | ||||||
import htsjdk.samtools._ | ||||||
import htsjdk.samtools.util.SequenceUtil | ||||||
|
||||||
import scala.collection.BufferedIterator | ||||||
import java.util.concurrent.atomic.AtomicLong | ||||||
import scala.collection.immutable.IndexedSeq | ||||||
import scala.collection.mutable | ||||||
import scala.collection.mutable.ListBuffer | ||||||
import scala.collection.{BufferedIterator, Iterator, mutable} | ||||||
|
||||||
|
||||||
object GroupReadsByUmi { | ||||||
|
@@ -439,9 +439,13 @@ object Strategy extends FgBioEnum[Strategy] { | |||||
| 3. The assigned UMI tag | ||||||
| 4. Read Name | ||||||
| | ||||||
|If the input is not template-coordinate sorted (i.e. `SO:unsorted GO:query SS:unsorted:template-coordinate`), then | ||||||
|this tool will re-sort the input. The output will be written in template-coordinate order. | ||||||
| | ||||||
|During grouping, reads are filtered out if a) all reads with the same queryname are unmapped, b) any primary | ||||||
|read has mapping quality < `min-map-q` (default=1), or c) the primary mappings for R1 and R2 are on different | ||||||
|chromosomes and `--allow-inter-contig` has been set to false. | ||||||
|read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different | ||||||
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered, | ||||||
|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't grok 446-447 after multiple attempts at reading the usage (before looking at the code). Perhaps you don't mean "non-primary"? And if not, then I'm really confused! |
||||||
| | ||||||
|Grouping of UMIs is performed by one of four strategies: | ||||||
| | ||||||
|
@@ -470,27 +474,38 @@ object Strategy extends FgBioEnum[Strategy] { | |||||
|(i.e. does not count dashes and other non-ACGT characters). This option is not implemented for reads with UMI pairs | ||||||
|(i.e. using the paired assigner). | ||||||
| | ||||||
|If the input is not template-coordinate sorted (i.e. `SO:unsorted GO:query SS:unsorted:template-coordinate`), then | ||||||
|this tool will re-sort the input. The ouitput will be written in template-coordinate order. | ||||||
|If the `--mark-duplicates` option is given, reads will also have their duplicate flag set in the BAM file. | ||||||
|Each tag-family is treated separately, and a single template within the tag family is chosen to be the "unique" | ||||||
|template and marked as non-duplicate, while all other templates in the tag family are then marked as duplicate. | ||||||
| | ||||||
|Several parameters have different defaults depending on whether duplicates are being marked or not (all are | ||||||
|directly settable on the command line): | ||||||
| | ||||||
| 1. `--min-map-q` defaults to 0 in duplicate marking mode and 1 otherwise | ||||||
| 2. `--include-secondary` defaults to true in duplicate marking mode and false otherwise | ||||||
| 3. `--include-suppementary` defaults to true in duplicate marking mode and false otherwise | ||||||
tfenne marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
""" | ||||||
) | ||||||
class GroupReadsByUmi | ||||||
( @arg(flag='i', doc="The input BAM file.") val input: PathToBam = Io.StdIn, | ||||||
@arg(flag='o', doc="The output BAM file.") val output: PathToBam = Io.StdOut, | ||||||
@arg(flag='f', doc="Optional output of tag family size counts.") val familySizeHistogram: Option[FilePath] = None, | ||||||
@arg(flag='t', doc="The tag containing the raw UMI.") val rawTag: String = "RX", | ||||||
@arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI", | ||||||
@arg(flag='m', doc="Minimum mapping quality for mapped reads.") val minMapQ: Int = 1, | ||||||
@arg(flag='n', doc="Include non-PF reads.") val includeNonPfReads: Boolean = false, | ||||||
@arg(flag='s', doc="The UMI assignment strategy.") val strategy: Strategy, | ||||||
@arg(flag='e', doc="The allowable number of edits between UMIs.") val edits: Int = 1, | ||||||
@arg(flag='l', doc= """The minimum UMI length. If not specified then all UMIs must have the same length, | ||||||
|otherwise discard reads with UMIs shorter than this length and allow for differing UMI lengths. | ||||||
|""") | ||||||
val minUmiLength: Option[Int] = None, | ||||||
@arg(flag='x', doc= """ | ||||||
|DEPRECATED: this option will be removed in future versions and inter-contig reads will be | ||||||
|automatically processed.""") | ||||||
(@arg(flag='i', doc="The input BAM file.") val input: PathToBam = Io.StdIn, | ||||||
@arg(flag='o', doc="The output BAM file.") val output: PathToBam = Io.StdOut, | ||||||
@arg(flag='f', doc="Optional output of tag family size counts.") val familySizeHistogram: Option[FilePath] = None, | ||||||
@arg(flag='t', doc="The tag containing the raw UMI.") val rawTag: String = "RX", | ||||||
@arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we finally update these to use
|
||||||
@arg(flag='d', doc="Turn on duplicate marking mode.") val markDuplicates: Boolean = false, | ||||||
@arg(flag='S', doc="Include secondary reads.") val includeSecondary: Option[Boolean] = None, | ||||||
@arg(flag='U', doc="Include supplementary reads.") val includeSupplementary: Option[Boolean] = None, | ||||||
@arg(flag='m', doc="Minimum mapping quality for mapped reads.") val minMapQ: Option[Int] = None, | ||||||
@arg(flag='n', doc="Include non-PF reads.") val includeNonPfReads: Boolean = false, | ||||||
@arg(flag='s', doc="The UMI assignment strategy.") val strategy: Strategy, | ||||||
@arg(flag='e', doc="The allowable number of edits between UMIs.") val edits: Int = 1, | ||||||
@arg(flag='l', doc= """The minimum UMI length. If not specified then all UMIs must have the same length, | ||||||
|otherwise discard reads with UMIs shorter than this length and allow for differing UMI lengths. | ||||||
|""") | ||||||
val minUmiLength: Option[Int] = None, | ||||||
@arg(flag='x', doc= """ | ||||||
|DEPRECATED: this option will be removed in future versions and inter-contig reads will be | ||||||
|automatically processed.""") | ||||||
@deprecated val allowInterContig: Boolean = true | ||||||
)extends FgBioTool with LazyLogging { | ||||||
import GroupReadsByUmi._ | ||||||
|
@@ -499,6 +514,10 @@ class GroupReadsByUmi | |||||
|
||||||
private val assigner = strategy.newStrategy(this.edits) | ||||||
|
||||||
// Give values to unset parameters that are different in duplicate marking mode | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coupling these seems odd to me & deviates from the intended change set. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you provide a bit more context of why it seems odd? My intuition is that if you're marking duplicates with this tool, you want to keep all reads (i.e. don't pre-filter based on mapq). Similarly, you want to set the duplicate flag on secondary and supplementary reads (and don't prefilter them) too. What am I missing? Also, what's the "intended change set"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that the tool allows for multiple versions of filtering, allowing flexibility to choose what one is filtering & not seems like it's giving appropriate control over the functionality, rather than forcing a workflow. It's true that, if one is keeping all reads, one must update all reads, but forcing all reads to stay, when marking doesn't seem like a requirement for that. The forcing to keep all is the odd part from my perspective. Of course one could filter reads after doing this, if so inclined, but maintaining flexibility, rather than forcing workflows seems like a functional win for the toolset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think any option is "forced", it just has a default when the option is unset. You can always set the option and this will override the default. For example, you can explicitly set
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair. I'm being overly dramatic with "forced" & you're right, it can be overidden, you're the maintainers so the paradigm the tool uses is ultimately up to you. What just strikes me as odd to change defaults when semi related parameters are passed - they aren't strictly coupled parameters, if they were strictly coupled i'd agree - i think this makes for a usage that implies a specific workflow & causes users to have to more closely read the docs if their workflow isn't that one. As a user, when defaults change because of other parameters it can result in unexpected results - which are often hard to debug. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair, the alternative is to provide a second command line tool that performs the duplicate marking that is just a facade over group reads by umi, but that seems a bit heavy handed. I'll let @tfenne take this discussion over, as I don't have strong feelings either way, but would lean to the current implementation as it is well-documented in the usage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made this change because otherwise, if one wants to use
which to me seems error prone. I don't see the harm in defaulting arguments one way when the intended usage is for consensus calling and another when the intended usage is for duplicate marking. The user is always free to set them explicitly. |
||||||
private val _minMapQ = this.minMapQ.getOrElse(if (this.markDuplicates) 0 else 1) | ||||||
private val _includeSecondaryReads = this.includeSecondary.getOrElse(this.markDuplicates) | ||||||
private val _includeSupplementaryReads = this.includeSupplementary.getOrElse(this.markDuplicates) | ||||||
|
||||||
/** Checks that the read's mapq is over a minimum, and if the read is paired, that the mate mapq is also over the min. */ | ||||||
private def mapqOk(rec: SamRecord, minMapQ: Int): Boolean = { | ||||||
|
@@ -546,11 +565,12 @@ class GroupReadsByUmi | |||||
// Filter and sort the input BAM file | ||||||
logger.info("Filtering the input.") | ||||||
val filteredIterator = in.iterator | ||||||
.filter(r => !r.secondary && !r.supplementary) | ||||||
.filter(r => this._includeSecondaryReads || !r.secondary ) | ||||||
.filter(r => this._includeSupplementaryReads || !r.supplementary ) | ||||||
.filter(r => (includeNonPfReads || r.pf) || { filteredNonPf += 1; false }) | ||||||
.filter(r => (r.mapped || (r.paired && r.mateMapped)) || { filteredPoorAlignment += 1; false }) | ||||||
.filter(r => (allowInterContig || r.unpaired || r.refIndex == r.mateRefIndex) || { filteredPoorAlignment += 1; false }) | ||||||
.filter(r => mapqOk(r, this.minMapQ) || { filteredPoorAlignment += 1; false }) | ||||||
.filter(r => mapqOk(r, this._minMapQ) || { filteredPoorAlignment += 1; false }) | ||||||
.filter(r => !r.get[String](rawTag).exists(_.contains('N')) || { filteredNsInUmi += 1; false }) | ||||||
.filter { r => | ||||||
this.minUmiLength.forall { l => | ||||||
|
@@ -610,10 +630,15 @@ class GroupReadsByUmi | |||||
// Take the next set of templates by position and assign UMIs | ||||||
val templates = takeNextGroup(templateCoordinateIterator, canTakeNextGroupByUmi=canTakeNextGroupByUmi) | ||||||
assignUmiGroups(templates) | ||||||
|
||||||
// Then output the records in the right order (assigned tag, read name, r1, r2) | ||||||
// Then group the records in the right order (assigned tag, read name, r1, r2) | ||||||
val templatesByMi = templates.groupBy { t => t.r1.get.apply[String](this.assignTag) } | ||||||
|
||||||
// If marking duplicates, assign bitflag to all duplicate reads | ||||||
if (this.markDuplicates) { | ||||||
templatesByMi.values.foreach(t => setDuplicateFlags(t)) | ||||||
} | ||||||
|
||||||
// Then output the records in the right order (assigned tag, read name, r1, r2) | ||||||
templatesByMi.keys.toSeq.sortBy(id => (id.length, id)).foreach(tag => { | ||||||
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.primaryReads).foreach(rec => { | ||||||
out += rec | ||||||
|
@@ -731,4 +756,18 @@ class GroupReadsByUmi | |||||
r1[String](this.rawTag) | ||||||
} | ||||||
} | ||||||
|
||||||
/** Sets the duplicate flags on all reads within all templates. */ | ||||||
private def setDuplicateFlags(group: Seq[Template]): Unit = { | ||||||
val nonDuplicateTemplate = group.minBy { template => | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want For example, if the duplicate score is the sum of base qualities, then don't we want the template with the highest sum of base qualities? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch! |
||||||
template.primaryReads.sumBy { r => | ||||||
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a compromise, consider setting the ScoringStrategy as a constructor parameter that's not exposed on the command line? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really don't want to expose this anywhere - I want the freedom to change this if/when we decide using HTSJDK's duplicate scoring stuff was a bad idea in the future, without breaking any APIs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A few things in here - this is more tidy logic, nice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just as a point re duplicate scoring strategies - the core strategies for choosing the representative read, much like UMI clustering algorithms are fairly established in literature & while having others seems possible, not supporting established ones, seems like a limitation which would be self imposed, I'm happy to wrap these enums if that's a need from the projects perspective so that the HTSJDK api isn't leaking. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lol, this did make me laugh a little. I would agree if deep thought had gone into the original duplicate scoring strategies ... but as the author of that original code (in Picard that moved into HTSJDK) I can tell you that it was me making it up on the fly and coming up with something vaguely sensible! Over time I tend to think that the duplicate marking stuff in Picard has gotten far too complicated. I'd also argue that there's little to no value tie-breaking on mapq. If the reads are considered duplicates of each other and have either a) the same sum of base qualities > Q15, or b) the same mapped reference length, there's a very high probability they also have the same sum of mapqs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One options seems fine. I think it's perfectly fine that that is sum of base q, nothing about picards implementation of these scoring strategies is complex, the usages deep in the code get to be rather complex, but this simple score calculator just seems to be good code reuse to me. I'm happy to look at how map q changes of note my testing is with bowtie2 mostly - so that certainly is impacting my perspective here. I'll stop pushing here & just test a few more things, give some perspective with data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @JoeVieira ... please let me know when you've had a chance to look at your data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tfenne it'll be a little bit before I'm able to dig back into that data. But will in the next week or so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be sensible to choose the read with highest mapq and breaking ties randomly. @JoeVieira did you have any suggestions after reviewing some data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry all, I have no been able to review data at scale yet. It's on my list for next week i promise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
} | ||||||
} | ||||||
|
||||||
group.foreach { template => | ||||||
val flag = !(template eq nonDuplicateTemplate) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
template.allReads.foreach(_.duplicate = flag) | ||||||
} | ||||||
} | ||||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.