-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative PR to add duplicate marking to GroupReadsByUmi #940
Merged
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
180b29f
consolodate imports
JoeVieira 596b9eb
consolodate imports
JoeVieira f7b72ab
Logic to calculate scores duplicate scores using htslib duplication d…
JoeVieira 6d62bfc
basic test for duplicate marking
JoeVieira 5bfc4a0
use mark dup flag to gate functionality, reorganize comments
JoeVieira 61bf600
single ended tests
JoeVieira 0cf6ddd
simplify comparison and score calculation into one method, using sort…
JoeVieira b9c075f
comments
JoeVieira 0a702e7
adjustments to tests for more explicit tie breaking.
JoeVieira ccd6882
ensure dups are not marked if flag not passed
JoeVieira 8e49514
remove extra linebreak
JoeVieira 90da16f
optionally include secondary & supplementary reads.
JoeVieira 25e8c5e
doc strings / usage docs
JoeVieira cc9f305
remove extra line break
JoeVieira 02a93ad
formatting
JoeVieira 39be56d
formatting
JoeVieira c79d7bd
clarify duplicate comments
JoeVieira 5a75724
Tidied up implementation of duplicate marking in GroupReadsByUmi.
tfenne f143c95
Cleanup
tfenne cc1cb99
Review fixup
tfenne 126009b
Update src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala
tfenne a637753
Added mapq as part of the scoring of duplicate reads.
tfenne 7161c4a
Update usage docs a little bit.
tfenne 61b46b1
Add codecov token
nh13 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ jobs: | |
- name: Code Coverage | ||
uses: codecov/[email protected] | ||
with: | ||
token: ${{ secrets.CODECOV_TOKEN }} | ||
flags: unittests # optional | ||
fail_ci_if_error: true # optional (default = false) | ||
verbose: true # optional (default = false) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coupling these seems odd to me & deviates from the intended change set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a bit more context of why it seems odd? My intuition is that if you're marking duplicates with this tool, you want to keep all reads (i.e. don't pre-filter based on mapq). Similarly, you want to set the duplicate flag on secondary and supplementary reads (and don't prefilter them) too. What am I missing?
Also, what's the "intended change set"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the tool allows for multiple versions of filtering, allowing flexibility to choose what one is filtering & not seems like it's giving appropriate control over the functionality, rather than forcing a workflow.
It's true that, if one is keeping all reads, one must update all reads, but forcing all reads to stay, when marking doesn't seem like a requirement for that. The forcing to keep all is the odd part from my perspective.
Of course one could filter reads after doing this, if so inclined, but maintaining flexibility, rather than forcing workflows seems like a functional win for the toolset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any option is "forced", it just has a default when the option is unset. You can always set the option and this will override the default. For example, you can explicitly set
--include-supplementary=true
or--include-suppementary=false
. I think that's why the types on the command line parameters areOption[Boolean]
with defaultNone
:None
-> use a defaultSome(true)
-> include supplementarySome(false)
-> filter out supplementaryThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. I'm being overly dramatic with "forced" & you're right, it can be overidden, you're the maintainers so the paradigm the tool uses is ultimately up to you.
What just strikes me as odd to change defaults when semi related parameters are passed - they aren't strictly coupled parameters, if they were strictly coupled i'd agree - i think this makes for a usage that implies a specific workflow & causes users to have to more closely read the docs if their workflow isn't that one.
As a user, when defaults change because of other parameters it can result in unexpected results - which are often hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, the alternative is to provide a second command line tool that performs the duplicate marking that is just a facade over group reads by umi, but that seems a bit heavy handed. I'll let @tfenne take this discussion over, as I don't have strong feelings either way, but would lean to the current implementation as it is well-documented in the usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change because otherwise, if one wants to use
GroupReadsByUmi
as an equivalent toMarkDuplicates
in picard, one would have had to add:--mark-duplicates --min-mapq 0 --include-secondary --include-supplementary
which to me seems error prone. I don't see the harm in defaulting arguments one way when the intended usage is for consensus calling and another when the intended usage is for duplicate marking. The user is always free to set them explicitly.