-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Piscem-cpp #47
Piscem-cpp #47
Conversation
Hi Noor, I've just brought the branch "submodule-integration-for-piscem" up to date wrt to the master branch. Would you close this PR for the master branch and open it for the branch "submodule-integration-for-piscem"? thank you! |
Hi Giulio, |
51169e0
into
jermp:submodule-integration-for-piscem
Merged! Thanks Noor. All future PRs should be reflected in that branch. I think that's the best way to proceed. |
Perfect; thanks so much for the quick merge @jermp! As we discussed in the piscem repo and as Noor points out here, the other minimal changes I think we will need are the ability to control the construction thread count and the ability to disable logging during construction (since we decided not to pull the heavy spdlog dependency into sshash). One other question I had is that I noticed that many of the Thanks! |
Hi @rob-p, Regarding your question about the kmer type, the -Giulio |
@NPSDC, can you explain the difference we tracked down when implementing cuttlefish parsing in the latest sshash where we got the same number of kmers but different number of super-kmers when using the old utility function? |
Mmmh that's strange. The number of super-kmers should only change if a different minimizer scheme is used or if canonical parsing is on. In the updated versions there are also a couple of compilation flags to define, like |
So, if you look at this commit under |
if canonical parsing is on, then the number of super-kmers should not change, right? |
Ok, so now: does it work correctly?
It does change the number of super-kmers because also the minimizers change and the sequences are parsed differently. |
Yes, it does work correctly now (like no change in output index, when
If both files are parsed with |
Ok, perfect!
Oh sure, if both builds use the flag |
Right so the only changes were:
to
and
to
This changed the number of observed super k-mers in the parse. |
Strange. I think that should not happen... |
It was small but not that small (few 100s) on chr13 genome. |
So more super-kmers in the new version compared to the old one used in piscem-cpp repo? 🤔 |
Pasting relevant pieces of log comparing the two in the initial version before the fix as mentioned by @rob-p
|
Sorry but I'm confused. Are you both referring to different numbers between different file formats ( |
We are referring to different numbers for the different file formats. The numbers become consistent after the changes mentioned by @rob-p. |
Yes, the string sets in the 2 files are the same. The only difference is the parsing function. |
The above PR is w.r.t efforts towards resolving the GitHub issue COMBINE-lab/piscem-cpp#2 aka using SSHash as a submodule in piscem-cpp.
Piscem-cpp takes the reduced
cf
format from cuttlefish as input. This PR now supports taking the above format as input for building the index. In addition, checker functions have also been updated to enable parsing thecfseg
file.We would also like to make feature requests in SSHash to enable its integration as a submodule in piscem.