Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement keep_intervals method #635

Merged
merged 1 commit into from
Jul 3, 2024
Merged

implement keep_intervals method #635

merged 1 commit into from
Jul 3, 2024

Conversation

bguo068
Copy link
Member

@bguo068 bguo068 commented May 29, 2024

This PR is aimed to address issue #615. A keep_intervals method is implemented for TableCollection and both TreeSequence structs (the latter is a wrapper of the former). The implementation is done in tskit:: rather than tskit::sys:: as the existing bindings seem to be adequate for this.

There are related methods, such as TreeSequence.trim(), that could be nice to add in the future, if this type of PR is interested.

Thanks!

Copy link
Member

@molpopgen molpopgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I've made some review contents based on a first pass. The most important thing is the testing. The tests need to be totally self-contained in rust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a different way to generate the test data. We cannot have dependencies on Python, and we don't want trees files committed to the repository. To make test data, use low-level operations in rust functions that are #[cfg(test)].

Beyond keeping developers sane, we also don't want to run into situations where the Python packages using tskit are using a different version of the C library than tskit-rust. If that were to happen, it becomes possible that we have ABI problems when trying to load the files.

/// Note that no new provenance will be appended.
pub fn keep_intervals(
&mut self,
intervals: impl Iterator<Item = (Position, Position)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the type of intervals to impl Iterator<Item = (P, P)> where P: Into Position. would improve the ergonomics. If P were Position, the .into() would optimize out.

}

// build mutation_map
let mutation_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do this using a map/collect idiom instead of allocating the vector internally?

///
/// Note that no new provenance will be appended.
pub fn keep_intervals(
&mut self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useful to think through if this fn should modify in place or consume self. There is no reason to follow the conventions of the Python API here. Consider the case of an error: if the function takes & mut self, then it is possible that the object is left in an invalid internal state, meaning that meaningfully continuing is not possible. When that is the case, one could argue that it is better to consume self.

Further, consuming self allows for better use of method chaining.

(In general, I think too much of the rust API works on &mut self and not self.)

Note that you can have the function accept mut self to consume it. IMO, though, it is a bit more intuitive to have it consume self, and then let mut tables = self inside the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point on error causing invalide interval state! I change it accordingly

@bguo068
Copy link
Member Author

bguo068 commented May 29, 2024

Thanks for the comments. I think they all make sense to me. I thought about the same thing regarding the test data but have not had a good idea of how to generate them manually using rust code. If you have some suggestions, it would be great. If not, I will look into the code base of rust/c versions of tskit for the conventions.

@molpopgen
Copy link
Member

Thanks for the comments. I think they all make sense to me. I thought about the same thing regarding the test data but have not had a good idea of how to generate them manually using rust code. If you have some suggestions, it would be great. If not, I will look into the code base of rust/c versions of tskit for the conventions.

There are two ways to go for building test data:

  1. Manually add rows to a table collection, sort it, convert to tree sequence for tests on that object.
  2. Randomly generate a bunch of nodes/edges/mutations as if you were doing a forward simulation. Then sort. Convert to tree sequence if necessary.

Both are useful and give different ways of stress-testing the machinery.

To make the tests really useful, one would also want a naive implementation of the algorithm defined in a rust test file. Here, performance doesn't matter and we should be able to assert equality and/or equivalence between the two outputs.

/// # Example
/// ```rust
/// use tskit::TreeSequence;
/// let mut tables = TreeSequence::load("./testdata/1.trees")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the doc tests, you can hide setup code. See here. Doing this lets you only show the reader the function call itself. The hidden code can contain assertions, etc., after it is called.

Copy link
Member Author

@bguo068 bguo068 May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion!
Currently, I am working on a simple simulation method following the tutorial on tskit c api webpage. I will make sure I follow this suggestion for doc test.

@molpopgen
Copy link
Member

@bguo068 what happens if one tries to keep intervals for which there are no data? Imagine a table collection with a sequence length of L, yet no edges end at L. Then you keep edges only for the end where there is no data. While the data model allows empty table collections and tree sequences, one could argue that returning None for this case makes things more obvious?

@bguo068
Copy link
Member Author

bguo068 commented May 31, 2024

@bguo068 what happens if one tries to keep intervals for which there are no data? Imagine a table collection with a sequence length of L, yet no edges end at L. Then you keep edges only for the end where there is no data. While the data model allows empty table collections and tree sequences, one could argue that returning None for this case makes things more obvious?

Yep. I can check the edge table at the end of keep_intervals method and return Ok(None) when edge table is empty.

@bguo068 bguo068 force-pushed the main branch 2 times, most recently from e3c7009 to 2e5d9c5 Compare June 1, 2024 04:49
@bguo068
Copy link
Member Author

bguo068 commented Jun 1, 2024

@molpopgen Many thanks for your guidance. I have added simulation code to generate trees that can be used to validate the keep_intervals methods. I have one question/issue:

To generate trees in both doc tests and test modules under the tests/ folder, it seems necessary to place the simulation code under the src/ folder and declare it in src/lib.rs. Currently, the simulation code resides in src/test_data.rs.

If this organization is not ideal, could you please advise on a better way to structure the code? Or we do not need generate trees for doc tests of keep_intervals?

@bguo068 bguo068 requested a review from molpopgen June 1, 2024 05:16
Copy link
Member

@molpopgen molpopgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is moving in the right direction. The primary issue here is that testing code has been added to the public API. We cannot have this -- testing code should be entirely internal.

I have not taken a big picture overview of the PR yet. I will do so this week.

Cargo.toml Outdated
@@ -27,6 +27,8 @@ serde_json = {version = "1.0.114", optional = true}
bincode = {version = "1.3.1", optional = true}
tskit-derive = {version = "0.2.0", path = "tskit-derive", optional = true}
delegate = "0.12.0"
rand = "0.8.3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand is already a dev dependency.

src/lib.rs Outdated
@@ -177,4 +177,5 @@ mod tests {
}

// Testing modules
pub mod test_data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use test_fixtures and do not make testing modules pub -- they are not part of the tskit API.

@@ -32,6 +32,7 @@ use crate::TskReturnValue;
use crate::{EdgeId, NodeId};
use ll_bindings::tsk_id_t;
use ll_bindings::tsk_size_t;
use streaming_iterator::StreamingIterator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a style point, we may want to have this use statement local to the functions that need it. If/when we modernize ourselves away from this dependency, it'll be easier to find.

///
/// # Example
/// ```rust
/// use tskit::TreeSequence;
/// let mut tables = TreeSequence::load("./testdata/1.trees")
/// # use tskit::test_data::simulation::simulate_two_treesequences;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are going for here. Doc tests only have access to the public api, but testing modules must not be public. Here, you need to be totally self-contained because we won't have a public API to do this.

/// # let popsplit_time = 10;
/// # let seed = 123;

/// # let (full_trees, _exepected) = simulate_two_treesequences(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- we don't want to provide a public API for this.

src/test_data.rs Outdated
@@ -0,0 +1,373 @@
/// mimic the c simulate function in tskit c api document
/// https://tskit.dev/tskit/docs/stable/c-api.html#basic-forwards-simulator
pub mod simulation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move all of this to test_fixtures.rs so that it is not pub outside of tskit.

/// use tskit::TreeSequence;
/// let mut ts = TreeSequence::load("testdata/1.trees").expect("error loading ts");
/// let new_ts = ts.keep_intervals(vec![(10.0.into(), 130.0.into())].into_iter(), true).unwrap();
/// # use tskit::test_data::simulation::simulate_two_treesequences;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here -- this will not be pub.

@bguo068
Copy link
Member Author

bguo068 commented Jun 1, 2024

@molpopgen I have tried to fix issues related to test code being public as api. Hope now its better. If there is anything else to be improved, please comment. Thanks.

@molpopgen
Copy link
Member

@molpopgen I have tried to fix issues related to test code being public as api. Hope now its better. If there is anything else to be improved, please comment. Thanks.

Thanks! I'll take a look later this week.

@molpopgen
Copy link
Member

The next step on my end will be to pull this branch and play around a bit. I will try to do that in the latter half of the coming week.

for intervals in intervals_lst {
// an empty tablecollection is enough here
let mut tables = TableCollection::new(100.0).unwrap();
tables.build_index().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a proper test? By starting with an empty tree sequence, there is no chance of, say, lifting over the same edge 2x? I would have expected the index building and/or the validate method to set an error in some cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep_intervals checks the validity of the intervals before it modifying table collection. I thought adding a few edges and nodes would not change the outcome of the tests. But I can add a few edges and nodes in case of some tskit errors that I am not aware of.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Probably no need for an extra test then. Once I'm back to work later this week, I'll take a deeper dive.

I may add a second test that uses proptest to generate random data. Those tests are nice for operations like this.

Cargo.toml Outdated Show resolved Hide resolved
.lending_iter()
.filter(|mrow| !!((mrow.right <= s) || (mrow.left >= e)));

while let Some(migration_row) = migration_iter.next() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests do not cover this block. It may be useful to come up with something to cover this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left migration table recording out of simulation as it does not work well with simplification as we can see from tskit C api docs for tsk_table_collection_simplify (although the keep_intervals python code deal with migration table):

Note
Migrations are currently not supported by simplify, and an error will be raised if we attempt call simplify on a table collection with greater than zero migrations. See tskit-dev/tskit#20

I can add a note to the Rust keep_intervals docs to reflect this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. One could envision a test of manually-generated data that don't require sorting. But the reality is that that block of code is adding data already present in the tables. The only way for there to be an error is if someone loaded a table collection with invalid row data generated by another tool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I will make a test that if some one try to call keep_intervals on treeseq that has nonempty migration and set simplify=true in the argument, it should return a tskit error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before implementing a manual check, just do the test. tskit-c will, I believe, set an error code for that case, so you can just ? on the simplify call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before implementing a manual check, just do the test. tskit-c will, I believe, set an error code for that case, so you can just ? on the simplify call.

Sorry I am not sure I understand your suggestion. Did you mean to do test on the method that generates a treesequence with non-empty migration table in src/test_fixtures.rs generate_simple_treesequence function ?

I thought I already used ? for the simplfy call in the keep_intervals method for TableCollection. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I misunderstood your comment re: testing. I think that your tests are actually okay except for the issue of adding the migrations after simplification.

@bguo068 bguo068 force-pushed the main branch 3 times, most recently from e93ff0a to 63fff51 Compare June 7, 2024 02:21
tables.full_sort(TableSortOptions::all()).unwrap();
tables.simplify(&[child1, child2], sim_opts, false).unwrap();

// add migration records after simplification to avoid errors when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm -- this could be the wrong thing to do?? I worry here that what is being added are unsimplified migrations to a simplified table? Further, you are not taking advantage of the output node ID map from simplification to do things like remap the child node IDs.

I think it is simpler and more correct for now to simply do this before simplification and let tskit-c raise an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm -- this could be the wrong thing to do?? I worry here that what is being added are unsimplified migrations to a simplified table? Further, you are not taking advantage of the output node ID map from simplification to do things like remap the child node IDs.

I think it is simpler and more correct for now to simply do this before simplification and let tskit-c raise an error.

I know it is not the best way to get some migration records in a treesequence but I just want to add something on the migration table to trigger error for calling simplify within the implementation of the keep_interval method.

If I add the migrtation records before simplifcation and let tskit c api raise an error, I wont be able to get a treesequence that has non-empty migration table that is later used to trigger error when calling keep_intervals.

Maybe you want me to remove the whole test_keep_intervals_nonempty_migration_table tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two problems that I see:

  1. The migration spans are incorrect and we cannot validate/check that.
  2. The node ids for the migrations are incorrect.

So if we want to keep this test (and I think we probably do), then you should at least remap the node ids to be correct using the output off simplify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I just updated the code. Hopefully it is better now.


// add migration records after simplification to avoid errors when
// simplifying a treesequence that contains a nonempty migration table
if add_migration_records {
let pop_anc = tables.add_population().unwrap();
let pop_1 = tables.add_population().unwrap();
let pop_2 = tables.add_population().unwrap();
// get new ids after simplifcation
let child1 = id_map[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable here to replace 0 with child1.as_usize(). Same for [1] in the following line.

Copy link
Member Author

@bguo068 bguo068 Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! I was thinking that the return idmap has the same length as samples: &[NodeId]. It actually has the length of self.nodes().num_rows() before the tablecollection or treesequence is modified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right -- the behavior allows you to update node-id-based data structures that are not managed by a table collection. This is very useful in practice.

I'm about to go camping and then on holiday. More when I return.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have fun!🤩

@molpopgen
Copy link
Member

@bguo068 I think that this is almost ready. Can you please do the following:

  1. rebase your PR against the current main branch.
  2. Squash your commits into a single commit using conventional commit syntax. For example: feat: added keep intervals for tables and tree sequences. You can see how other commits look via git log. We use this syntax to automate changelog generation.

@bguo068
Copy link
Member Author

bguo068 commented Jul 2, 2024

@bguo068 I think that this is almost ready. Can you please do the following:

  1. rebase your PR against the current main branch.
  2. Squash your commits into a single commit using conventional commit syntax. For example: feat: added keep intervals for tables and tree sequences. You can see how other commits look via git log. We use this syntax to automate changelog generation.

@molpopgen Thanks for the instructions. I've rebased my PR against the current main branch and squashed the commits into a single commit using the conventional commit syntax. Please let me know if any further adjustments are needed.

@molpopgen
Copy link
Member

Thanks! I'll take one last look to check for any "red flags", but I do think that this can be merged soon.

@molpopgen
Copy link
Member

I think there is one thing missing:

  • If the tables are simplified, we don't have a means of returning the node id map.
  • Without this map, some information important to client code can be lost.

I don't think that this is a problem for this PR, though. I'll try to work through the semantics in a later PR and before the next stable release.

@molpopgen
Copy link
Member

@bguo068 I think that the solution to the simplification issue will be to separate trimming to intervals from simplification. Given that rust is a low level language, functions managing multiple operations like "retain AND simplify IF a user so desires" may not be the way to go.

I'll puzzle this out in a downstream PR like I said.

@molpopgen
Copy link
Member

Going to merge this now. Thanks @bguo068 !!

@molpopgen molpopgen merged commit cb37378 into tskit-dev:main Jul 3, 2024
11 checks passed
@bguo068
Copy link
Member Author

bguo068 commented Jul 3, 2024

@bguo068 I think that the solution to the simplification issue will be to separate trimming to intervals from simplification. Given that rust is a low level language, functions managing multiple operations like "retain AND simplify IF a user so desires" may not be the way to go.

I'll puzzle this out in a downstream PR like I said.

@bguo068 I think that the solution to the simplification issue will be to separate trimming to intervals from simplification. Given that rust is a low level language, functions managing multiple operations like "retain AND simplify IF a user so desires" may not be the way to go.

I'll puzzle this out in a downstream PR like I said.

@molpopgen Thank you for all your guidance and support in making this happen. I look forward to contributing more features to this project in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants