Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement truncation of tree sequences to specified genome intervals #615

Closed
bguo068 opened this issue Apr 30, 2024 · 8 comments
Closed
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@bguo068
Copy link
Member

bguo068 commented Apr 30, 2024

The recent addition of a method allowing the tree iterator to start at a specific position (#435) is a significant improvement. However, there are scenarios where it would be beneficial to trim or truncate the tree sequence to specific genome intervals. This functionality would (1) simplify the handling of tree sequences by reducing their size, and (2) enable users to focus on particular regions of interest, excluding irrelevant areas.

Currently, a Python API for this feature exists within tskit (tskit.TreeSequence.keep_intervals), but there is no equivalent C API or Rust binding available. The implementation could potentially be adapted from the existing Python method as outlined here: tskit TableCollection.keep_intervals.

Implementing a keep_intervals method for the TreeSequence or TableCollection structures would greatly enhance functionality and user flexibility in managing tree sequences.

@molpopgen molpopgen added enhancement New feature or request good first issue Good for newcomers labels Apr 30, 2024
@molpopgen
Copy link
Member

Thanks @bguo068. Do you have a time frame in mind for needing such a feature? I'm trying to prioritize/triage things for the next release.

@bguo068
Copy link
Member Author

bguo068 commented Apr 30, 2024

Thank you for your impressive work on the development and maintenance of this repository. I am currently enhancing my skills to contribute more effectively to Rust projects. I would like to prepare a pull request and aim to submit it within the next month. Could you please suggest specific areas of this codebase or provide some initial steps where I should start? I would greatly appreciate your guidance as I begin this process.

@molpopgen
Copy link
Member

In terms of design:

  • We only need this method for a table collection. For a tree sequence: we can dump tables w/out a copy, apply "keep" intervals, and then convert back to a tree sequence. Idiomatically, rust's method chaining lets us do all of this without repeating methods in multiple places.
  • The method should consume the table collection.

In terms of implementation:

  • Ideally, the work would happen in the sys module first. We'd add the method to the tree sequence struct therein, handling any unsafe there. These types are only pub at the crate level. The final public API from the tskit::TreeSequence would dispatch out the inner function.
  • I want to do a bit of type refactoring, pushing the newtype implementations from the main crate into sys. Doing so would enable sys to be implemented without having do deal with the problems that arise to due tsk_id_t, etc., being easy to misuse at the C level. I'll do this rather soon, I think.
  • The Python implementation is only a partial guide. It relies on numpy to provide high-level operations. A rust implementation would be lower-level but the basic ideas are the same. Handling metadata would be the trickiest thing, I think.

@bguo068
Copy link
Member Author

bguo068 commented Apr 30, 2024

Thanks for the guidance. I will follow these points carefully in preparation for the PR.
I have not used metadata before, so it could be nice to learn during the process.

@molpopgen
Copy link
Member

Thanks for the guidance. I will follow these points carefully in preparation for the PR. I have not used metadata before, so it could be nice to learn during the process.

In C, the metadata are represented using char *, which is rust's c_char. That is often a signed integer type. However, we have to interact with this using unsigned integer types due to how rust represents strings. The difference requires pointer casts, etc.. At the low level sys module, we want to handle everything via &[u8], and there is no need to deal with the higher level metadata API directly.

@molpopgen
Copy link
Member

@bguo068 I just released an alpha of 0.15 to crates.io. There are some breaking changes in this release, but I expect that most people won't notice them. The docs are up on docs.rs but you have to manually select the new version because it is a pre-release and docs.rs shows the latest stable release by default.

@bguo068
Copy link
Member Author

bguo068 commented May 8, 2024

It will be easier to access the new features and docs. Thanks for the alpha release.

@molpopgen
Copy link
Member

Closed via #635

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants