Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of a SAMRecord comparator that matches samtools' queryname sort order. #1600

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Mar 30, 2022

@nh13, @lbergelson I put this together mostly for fun - it's a comparator that will sort read names the same way samtools does. I'm not really sure what the best way to integrate this is in HTSJDK though. Thoughts?

@codecov-commenter
Copy link

codecov-commenter commented Mar 30, 2022

Codecov Report

Merging #1600 (2260167) into master (a38c78d) will increase coverage by 0.014%.
The diff coverage is 94.118%.

@@               Coverage Diff               @@
##              master     #1600       +/-   ##
===============================================
+ Coverage     69.838%   69.852%   +0.014%     
- Complexity      9653      9664       +11     
===============================================
  Files            703       704        +1     
  Lines          37647     37681       +34     
  Branches        6114      6121        +7     
===============================================
+ Hits           26292     26321       +29     
- Misses          8904      8906        +2     
- Partials        2451      2454        +3     
Impacted Files Coverage Δ
.../samtools/SAMRecordQueryNameNaturalComparator.java 94.118% <94.118%> (ø)
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 80.952% <0.000%> (-9.524%) ⬇️
src/main/java/htsjdk/io/AsyncWriterPool.java 72.222% <0.000%> (-1.389%) ⬇️

@nh13
Copy link
Member

nh13 commented Mar 30, 2022

Always best to be I second place: #1601

@nh13
Copy link
Member

nh13 commented Mar 30, 2022

Do you need to fall back on the first and second of pair flags as samtools does? https://github.com/samtools/samtools/blob/bdc5bb81c11f2ab25ea97351213cb87b33857c4d/bam_sort.c#L1798

@tfenne
Copy link
Member Author

tfenne commented Mar 30, 2022

@nh13 I guess it depends on whether you want to match samtools 100% or implement the "natural" sort order which doesn't prescribe how ties are broken. I don't think I have a strong opinion so long as e.g. first of pair sorts before second of pair, and secondary/supplementary records sort after primary

@tfenne
Copy link
Member Author

tfenne commented Mar 30, 2022

I think the ideal way to make this work, but which would be a PITA to retrofit would be something akin to fgbio's SamOrder, which defines more than the SO tag, and has capabilities for setting SO, SS etc. and for detecting the ordering based on more than the SO tag. But we had the benefit of writing that many years later, and after the sub-sort stuff landed in the spec.

@lbergelson
Copy link
Member

lbergelson commented Mar 30, 2022

Ah, the eternal problem #766. I was assuming that awkwardly working around this was the impetus for the last pr?

I think we have a few options none of which are trivial.
I like what you describe. The subsort field was invented to fix this problem so it makes sense that we would use it. Do you know if samtools is outputting the appropriate SS field in new versions? We never started doing so, but maybe they're more responsible... That would solve the problem going forward, and we could allow users to pick which variant order they want to use when writing new files. It wouldn't solve the problem of reading older sam files.

An alternative would be to try to make some sort of meta queryname sorter which tries to figure out the subsort when it's reading a file. That shouldn't be hard I would think, just try both sort orders and if one passes it's valid. I can't remember if there's any issue with stateful comparators but I think it wouldn't be a problem to have one that figures out which it is and then remembers it.

We should write the SS field in either case...

@tfenne
Copy link
Member Author

tfenne commented Mar 30, 2022

@lbergelson I don't think samtools writes the SS field either for queryname sorting. Though we might be able to get @jmarshall or @nh13 to help with that.

I think the easiest first step would be to have HTSJDK accept either as being queryname sorted - HTSJDK already has that SAMSortOrderChecker which is stateful. I don't think it would be hard to extend that to allow for either/or checking for queryname sorting. And that way we could at least get to the point where HTSJDK (and picard etc.) would accept either samtools or HTSJDK queryname sorting (or, "natural" and "lexicographic") as valid. It's possible that would just fix ValidateSamFile in Picard, or it might require a little more tweaking.

Then the question becomes ... if it can accept either as valid, is there any reason not to switch to the samtools style as the default output for queryname?

@lbergelson
Copy link
Member

The main reason not to switch the default immediately is that a lot of people used mixed pipeline where some of the tools are on older versions. Switching would break them until they update all parts of the pipeline.

Does samtools have trouble reading htsjdk queryname bams? I'm not sure the problem is symmetrical.

@tfenne
Copy link
Member Author

tfenne commented Mar 31, 2022

I don't believe there are issues with samtools reading picard/htsjdk queryname sorted files, but it's not something I do regularly. I think samtools tends more towards trusting/assuming the user provided acceptable input and not validating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants