Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piscem-cpp #47

Merged
merged 6 commits into from
Jul 26, 2024
Merged

Conversation

NPSDC
Copy link

@NPSDC NPSDC commented Jul 26, 2024

The above PR is w.r.t efforts towards resolving the GitHub issue COMBINE-lab/piscem-cpp#2 aka using SSHash as a submodule in piscem-cpp.

Piscem-cpp takes the reduced cf format from cuttlefish as input. This PR now supports taking the above format as input for building the index. In addition, checker functions have also been updated to enable parsing the cfseg file.

We would also like to make feature requests in SSHash to enable its integration as a submodule in piscem.

  1. Allow the build command to accept a thread parameter - currently, the default behaviour (in the Master branch of SShash) is to use all available threads.
  2. Add another flag quiet, that turns on/off logging when building the index.

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

Hi Noor,
fantastic! Thank you very much for this PR.
I have set up some time ago a branch called "submodule-integration-for-piscem" and I think all the piscem related stuff should go there. (Th current master branch is waiting for another PR to be merged.)

I've just brought the branch "submodule-integration-for-piscem" up to date wrt to the master branch.

Would you close this PR for the master branch and open it for the branch "submodule-integration-for-piscem"?

thank you!
-Giulio

@NPSDC NPSDC changed the base branch from master to submodule-integration-for-piscem July 26, 2024 13:23
@NPSDC
Copy link
Author

NPSDC commented Jul 26, 2024

Hi Giulio,
Thank you so much for the prompt response. I have modified the PR for merging into "submodule-integration-for-piscem"

@jermp jermp merged commit 51169e0 into jermp:submodule-integration-for-piscem Jul 26, 2024
2 checks passed
@jermp
Copy link
Owner

jermp commented Jul 26, 2024

Merged! Thanks Noor. All future PRs should be reflected in that branch. I think that's the best way to proceed.
-Giulio

@rob-p
Copy link
Contributor

rob-p commented Jul 26, 2024

Perfect; thanks so much for the quick merge @jermp! As we discussed in the piscem repo and as Noor points out here, the other minimal changes I think we will need are the ability to control the construction thread count and the ability to disable logging during construction (since we decided not to pull the heavy spdlog dependency into sshash).

One other question I had is that I noticed that many of the util functions for e.g. extracting k-mers have been updated from the (now long outdated) sshash version we vendor with piscem-cpp. When k is small (<=31) what are the differences between the new kmer type and the previous representation which was just a literal uint64_t?

Thanks!
Rob

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

Hi @rob-p,
good that we are not going to ship with piscem the spdlog. For the quiet build, it should be pretty straightforward: I'd even alter the current role of --verbose to make it control this aspect. Currently, --verbose prints some extra info but we can directly make it work like it outputs the current logging. (So, if not provided, it makes everything quiet.)

Regarding your question about the kmer type, the kmer.hpp just defines kmer_t as uint64_t for "small" k. Hence, there is no difference. Or are you referring to something else / have a specific function in mind?

-Giulio

@rob-p
Copy link
Contributor

rob-p commented Jul 26, 2024

@NPSDC, can you explain the difference we tracked down when implementing cuttlefish parsing in the latest sshash where we got the same number of kmers but different number of super-kmers when using the old utility function?

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

Mmmh that's strange. The number of super-kmers should only change if a different minimizer scheme is used or if canonical parsing is on. In the updated versions there are also a couple of compilation flags to define, like SSHASH_USE_MAX_KMER_LENGTH_63 and SSHASH_USE_TRADITIONAL_NUCLEOTIDE_ENCODING. But under the same config, the utility functions should work exactly the same.

@NPSDC
Copy link
Author

NPSDC commented Jul 26, 2024

So, if you look at this commit under parse_file.hpp e802c97, only two functions were changed and we saw that the total number of super-kmers and corresponding downstream changes in the size of the partitions were observed. Using the uint_kmer instead of uint64 seemed to have resolved the differences.

@NPSDC
Copy link
Author

NPSDC commented Jul 26, 2024

Mmmh that's strange. The number of super-kmers should only change if a different minimizer scheme is used or if canonical parsing is on. In the updated versions there are also a couple of compilation flags to define, like SSHASH_USE_MAX_KMER_LENGTH_63 and SSHASH_USE_TRADITIONAL_NUCLEOTIDE_ENCODING. But under the same config, the utility functions should work exactly the same.

if canonical parsing is on, then the number of super-kmers should not change, right?

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

So, if you look at this commit under parse_file.hpp e802c97, only two functions were changed and we saw that the total number of super-kmers and corresponding downstream changes in the size of the partitions were observed. Using the uint_kmer instead of uint64 seemed to have resolved the differences.

Ok, so now: does it work correctly?

if canonical parsing is on, then the number of super-kmers should not change, right?

It does change the number of super-kmers because also the minimizers change and the sequences are parsed differently.

@NPSDC
Copy link
Author

NPSDC commented Jul 26, 2024

Ok, so now: does it work correctly?

Yes, it does work correctly now (like no change in output index, when fasta or the cfseg file is used.)

It does change the number of super-kmers because also the minimizers change and the sequences are parsed differently.

If both files are parsed with canonical-parsing on, then it is the canonical-kmer that would be considered and so the minimizer should not change. Please correct me if I am wrong in my understanding.

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

Yes, it does work correctly now (like no change in output index, when fasta or the cfseg file is used.

Ok, perfect!

If both files are parsed with canonical-parsing on, then it is the canonical-kmer that would be considered and so the minimizer should not change. Please correct me if I am wrong in my understanding.

Oh sure, if both builds use the flag --canonical-parsing, then the number of super-kmers must be the same.
Before, I was referring to "regular" parsing vs. "canonical" parsing. (The canonical one introduces more minimizers and hence, more super-kmers.)

@rob-p
Copy link
Contributor

rob-p commented Jul 26, 2024

Right so the only changes were:

uint64_t uint64_kmer = util::string_to_uint64_no_reverse(kmer, k);

to

kmer_t uint_kmer = util::string_to_uint_kmer(kmer, k);

and

uint64_kmer_rc = util::compute_reverse_complement(uint64_kmer, k);
uint64_t minimizer_rc = util::compute_minimizer(uint64_kmer_rc, k, m, seed);

to

kmer_t uint_kmer_rc = util::compute_reverse_complement(uint_kmer, k);
uint64_t minimizer_rc = util::compute_minimizer(uint_kmer_rc, k, m, seed);

This changed the number of observed super k-mers in the parse.

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

Strange. I think that should not happen...
I bet the difference is small at least, right?

@NPSDC
Copy link
Author

NPSDC commented Jul 26, 2024

It was small but not that small (few 100s) on chr13 genome.

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

So more super-kmers in the new version compared to the old one used in piscem-cpp repo? 🤔

@NPSDC
Copy link
Author

NPSDC commented Jul 26, 2024

Pasting relevant pieces of log comparing the two in the initial version before the fix as mentioned by @rob-p

reading file '../../../DNA_datasets/chr13_f1.fa'...
read 740305 sequences, 113120928 bases, 90911778 kmers
num_kmers 90911778
num_super_kmers 13544993
num_pieces 740306
k = 31, m = 15, seed = 1, l = 6, c = 3, canonical_parsing = true, weighted = false, file type = cfseg
reading file '../../../DNA_datasets/chr13.cf_seg'...
m_buffer_size 29411764
sorting buffer...
saving to file './sshash.tmp.run_1722023687779820410.minimizers.0.bin'...
read 0 sequences, 113120928 bases, 90911778 kmers
num_kmers 90911778
num_super_kmers 13509768
num_pieces 740306 (+0.488588 [bits/kmer])

@jermp
Copy link
Owner

jermp commented Jul 26, 2024

Sorry but I'm confused. Are you both referring to different numbers between different file formats (.fa vs. .cf_seg) or, given a file, both implementation return the same numbers?

@NPSDC
Copy link
Author

NPSDC commented Jul 26, 2024

We are referring to different numbers for the different file formats. The numbers become consistent after the changes mentioned by @rob-p.

@rob-p
Copy link
Contributor

rob-p commented Jul 26, 2024

Yes, the string sets in the 2 files are the same. The only difference is the parsing function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants