implement keep_intervals method #635

bguo068 · 2024-05-29T04:46:36Z

This PR is aimed to address issue #615. A keep_intervals method is implemented for TableCollection and both TreeSequence structs (the latter is a wrapper of the former). The implementation is done in tskit:: rather than tskit::sys:: as the existing bindings seem to be adequate for this.

There are related methods, such as TreeSequence.trim(), that could be nice to add in the future, if this type of PR is interested.

Thanks!

molpopgen

Thanks for this! I've made some review contents based on a first pass. The most important thing is the testing. The tests need to be totally self-contained in rust.

molpopgen · 2024-05-29T14:34:47Z

testdata/gen_trees.py

We need a different way to generate the test data. We cannot have dependencies on Python, and we don't want trees files committed to the repository. To make test data, use low-level operations in rust functions that are #[cfg(test)].

Beyond keeping developers sane, we also don't want to run into situations where the Python packages using tskit are using a different version of the C library than tskit-rust. If that were to happen, it becomes possible that we have ABI problems when trying to load the files.

molpopgen · 2024-05-29T15:09:43Z

src/table_collection.rs

+    /// Note that no new provenance will be appended.
+    pub fn keep_intervals(
+        &mut self,
+        intervals: impl Iterator<Item = (Position, Position)>,


Changing the type of intervals to impl Iterator<Item = (P, P)> where P: Into Position. would improve the ergonomics. If P were Position, the .into() would optimize out.

molpopgen · 2024-05-29T15:11:22Z

src/table_collection.rs

+        }
+
+        // build mutation_map
+        let mutation_map = {


I think you could do this using a map/collect idiom instead of allocating the vector internally?

molpopgen · 2024-05-29T15:16:43Z

src/table_collection.rs

+    ///
+    /// Note that no new provenance will be appended.
+    pub fn keep_intervals(
+        &mut self,


It is useful to think through if this fn should modify in place or consume self. There is no reason to follow the conventions of the Python API here. Consider the case of an error: if the function takes & mut self, then it is possible that the object is left in an invalid internal state, meaning that meaningfully continuing is not possible. When that is the case, one could argue that it is better to consume self.

Further, consuming self allows for better use of method chaining.

(In general, I think too much of the rust API works on &mut self and not self.)

Note that you can have the function accept mut self to consume it. IMO, though, it is a bit more intuitive to have it consume self, and then let mut tables = self inside the function.

Great point on error causing invalide interval state! I change it accordingly

bguo068 · 2024-05-29T15:35:32Z

Thanks for the comments. I think they all make sense to me. I thought about the same thing regarding the test data but have not had a good idea of how to generate them manually using rust code. If you have some suggestions, it would be great. If not, I will look into the code base of rust/c versions of tskit for the conventions.

molpopgen · 2024-05-29T15:41:33Z

Thanks for the comments. I think they all make sense to me. I thought about the same thing regarding the test data but have not had a good idea of how to generate them manually using rust code. If you have some suggestions, it would be great. If not, I will look into the code base of rust/c versions of tskit for the conventions.

There are two ways to go for building test data:

Manually add rows to a table collection, sort it, convert to tree sequence for tests on that object.
Randomly generate a bunch of nodes/edges/mutations as if you were doing a forward simulation. Then sort. Convert to tree sequence if necessary.

Both are useful and give different ways of stress-testing the machinery.

To make the tests really useful, one would also want a naive implementation of the algorithm defined in a rust test file. Here, performance doesn't matter and we should be able to assert equality and/or equivalence between the two outputs.

molpopgen · 2024-05-31T10:22:52Z

src/table_collection.rs

+    /// # Example
+    /// ```rust
+    /// use tskit::TreeSequence;
+    /// let mut tables = TreeSequence::load("./testdata/1.trees")


For the doc tests, you can hide setup code. See here. Doing this lets you only show the reader the function call itself. The hidden code can contain assertions, etc., after it is called.

Very good suggestion!
Currently, I am working on a simple simulation method following the tutorial on tskit c api webpage. I will make sure I follow this suggestion for doc test.

molpopgen · 2024-05-31T11:28:22Z

@bguo068 what happens if one tries to keep intervals for which there are no data? Imagine a table collection with a sequence length of L, yet no edges end at L. Then you keep edges only for the end where there is no data. While the data model allows empty table collections and tree sequences, one could argue that returning None for this case makes things more obvious?

bguo068 · 2024-05-31T13:05:18Z

@bguo068 what happens if one tries to keep intervals for which there are no data? Imagine a table collection with a sequence length of L, yet no edges end at L. Then you keep edges only for the end where there is no data. While the data model allows empty table collections and tree sequences, one could argue that returning None for this case makes things more obvious?

Yep. I can check the edge table at the end of keep_intervals method and return Ok(None) when edge table is empty.

bguo068 · 2024-06-01T05:01:08Z

@molpopgen Many thanks for your guidance. I have added simulation code to generate trees that can be used to validate the keep_intervals methods. I have one question/issue:

To generate trees in both doc tests and test modules under the tests/ folder, it seems necessary to place the simulation code under the src/ folder and declare it in src/lib.rs. Currently, the simulation code resides in src/test_data.rs.

If this organization is not ideal, could you please advise on a better way to structure the code? Or we do not need generate trees for doc tests of keep_intervals?

molpopgen

Thanks. This is moving in the right direction. The primary issue here is that testing code has been added to the public API. We cannot have this -- testing code should be entirely internal.

I have not taken a big picture overview of the PR yet. I will do so this week.

molpopgen · 2024-06-01T15:16:09Z

Cargo.toml

@@ -27,6 +27,8 @@ serde_json = {version = "1.0.114", optional = true}
 bincode = {version = "1.3.1", optional = true}
 tskit-derive = {version = "0.2.0", path = "tskit-derive", optional = true}
 delegate = "0.12.0"
+rand = "0.8.3"


rand is already a dev dependency.

molpopgen · 2024-06-01T15:16:35Z

src/lib.rs

@@ -177,4 +177,5 @@ mod tests {
 }

 // Testing modules
+pub mod test_data;


use test_fixtures and do not make testing modules pub -- they are not part of the tskit API.

molpopgen · 2024-06-01T15:17:32Z

src/table_collection.rs

@@ -32,6 +32,7 @@ use crate::TskReturnValue;
 use crate::{EdgeId, NodeId};
 use ll_bindings::tsk_id_t;
 use ll_bindings::tsk_size_t;
+use streaming_iterator::StreamingIterator;


As a style point, we may want to have this use statement local to the functions that need it. If/when we modernize ourselves away from this dependency, it'll be easier to find.

molpopgen · 2024-06-01T15:19:03Z

src/table_collection.rs

    ///
    /// # Example
    /// ```rust
-    /// use tskit::TreeSequence;
-    /// let mut tables = TreeSequence::load("./testdata/1.trees")
+    /// # use tskit::test_data::simulation::simulate_two_treesequences;


I see what you are going for here. Doc tests only have access to the public api, but testing modules must not be public. Here, you need to be totally self-contained because we won't have a public API to do this.

molpopgen · 2024-06-01T15:19:28Z

src/table_collection.rs

+    /// # let popsplit_time = 10;
+    /// # let seed = 123;
+
+    /// # let (full_trees, _exepected) = simulate_two_treesequences(


Same here -- we don't want to provide a public API for this.

molpopgen · 2024-06-01T15:20:02Z

src/test_data.rs

@@ -0,0 +1,373 @@
+/// mimic the c simulate function in tskit c api document
+/// https://tskit.dev/tskit/docs/stable/c-api.html#basic-forwards-simulator
+pub mod simulation {


move all of this to test_fixtures.rs so that it is not pub outside of tskit.

molpopgen · 2024-06-01T15:20:31Z

src/trees/treeseq.rs

-    /// use tskit::TreeSequence;
-    /// let mut ts = TreeSequence::load("testdata/1.trees").expect("error loading ts");
-    /// let new_ts = ts.keep_intervals(vec![(10.0.into(), 130.0.into())].into_iter(), true).unwrap();
+    /// # use tskit::test_data::simulation::simulate_two_treesequences;


Ditto here -- this will not be pub.

bguo068 · 2024-06-01T18:44:37Z

@molpopgen I have tried to fix issues related to test code being public as api. Hope now its better. If there is anything else to be improved, please comment. Thanks.

molpopgen · 2024-06-01T21:38:19Z

@molpopgen I have tried to fix issues related to test code being public as api. Hope now its better. If there is anything else to be improved, please comment. Thanks.

Thanks! I'll take a look later this week.

src/test_fixtures.rs

molpopgen · 2024-06-02T17:53:21Z

The next step on my end will be to pull this branch and play around a bit. I will try to do that in the latter half of the coming week.

molpopgen · 2024-06-02T17:58:39Z

src/test_fixtures.rs

+        for intervals in intervals_lst {
+            // an empty tablecollection is enough here
+            let mut tables = TableCollection::new(100.0).unwrap();
+            tables.build_index().unwrap();


Is this a proper test? By starting with an empty tree sequence, there is no chance of, say, lifting over the same edge 2x? I would have expected the index building and/or the validate method to set an error in some cases?

keep_intervals checks the validity of the intervals before it modifying table collection. I thought adding a few edges and nodes would not change the outcome of the tests. But I can add a few edges and nodes in case of some tskit errors that I am not aware of.

Got it. Probably no need for an extra test then. Once I'm back to work later this week, I'll take a deeper dive.

I may add a second test that uses proptest to generate random data. Those tests are nice for operations like this.

Cargo.toml

molpopgen · 2024-06-06T16:17:28Z

src/table_collection.rs

+                .lending_iter()
+                .filter(|mrow| !!((mrow.right <= s) || (mrow.left >= e)));
+
+            while let Some(migration_row) = migration_iter.next() {


The tests do not cover this block. It may be useful to come up with something to cover this?

I left migration table recording out of simulation as it does not work well with simplification as we can see from tskit C api docs for tsk_table_collection_simplify (although the keep_intervals python code deal with migration table):

Note
Migrations are currently not supported by simplify, and an error will be raised if we attempt call simplify on a table collection with greater than zero migrations. See tskit-dev/tskit#20

I can add a note to the Rust keep_intervals docs to reflect this.

That's fair. One could envision a test of manually-generated data that don't require sorting. But the reality is that that block of code is adding data already present in the tables. The only way for there to be an error is if someone loaded a table collection with invalid row data generated by another tool.

Right. I will make a test that if some one try to call keep_intervals on treeseq that has nonempty migration and set simplify=true in the argument, it should return a tskit error.

Before implementing a manual check, just do the test. tskit-c will, I believe, set an error code for that case, so you can just ? on the simplify call.

Before implementing a manual check, just do the test. tskit-c will, I believe, set an error code for that case, so you can just ? on the simplify call.

Sorry I am not sure I understand your suggestion. Did you mean to do test on the method that generates a treesequence with non-empty migration table in src/test_fixtures.rs generate_simple_treesequence function ?

I thought I already used ? for the simplfy call in the keep_intervals method for TableCollection. Am I missing something?

I think that I misunderstood your comment re: testing. I think that your tests are actually okay except for the issue of adding the migrations after simplification.

molpopgen · 2024-06-07T15:52:17Z

src/test_fixtures.rs

+        tables.full_sort(TableSortOptions::all()).unwrap();
+        tables.simplify(&[child1, child2], sim_opts, false).unwrap();
+
+        // add migration records after simplification to avoid errors when


Hmmm -- this could be the wrong thing to do?? I worry here that what is being added are unsimplified migrations to a simplified table? Further, you are not taking advantage of the output node ID map from simplification to do things like remap the child node IDs.

I think it is simpler and more correct for now to simply do this before simplification and let tskit-c raise an error.

Hmmm -- this could be the wrong thing to do?? I worry here that what is being added are unsimplified migrations to a simplified table? Further, you are not taking advantage of the output node ID map from simplification to do things like remap the child node IDs.

I think it is simpler and more correct for now to simply do this before simplification and let tskit-c raise an error.

I know it is not the best way to get some migration records in a treesequence but I just want to add something on the migration table to trigger error for calling simplify within the implementation of the keep_interval method.

If I add the migrtation records before simplifcation and let tskit c api raise an error, I wont be able to get a treesequence that has non-empty migration table that is later used to trigger error when calling keep_intervals.

Maybe you want me to remove the whole test_keep_intervals_nonempty_migration_table tests?

There are two problems that I see:

The migration spans are incorrect and we cannot validate/check that.

The node ids for the migrations are incorrect.

So if we want to keep this test (and I think we probably do), then you should at least remap the node ids to be correct using the output off simplify.

Got it. I just updated the code. Hopefully it is better now.

molpopgen · 2024-06-11T21:52:47Z

src/test_fixtures.rs


        // add migration records after simplification to avoid errors when
        // simplifying a treesequence that contains a nonempty migration table
        if add_migration_records {
            let pop_anc = tables.add_population().unwrap();
            let pop_1 = tables.add_population().unwrap();
            let pop_2 = tables.add_population().unwrap();
+            // get new ids after simplifcation
+            let child1 = id_map[0];


It would be preferable here to replace 0 with child1.as_usize(). Same for [1] in the following line.

Thanks for catching that! I was thinking that the return idmap has the same length as samples: &[NodeId]. It actually has the length of self.nodes().num_rows() before the tablecollection or treesequence is modified.

Right -- the behavior allows you to update node-id-based data structures that are not managed by a table collection. This is very useful in practice.

I'm about to go camping and then on holiday. More when I return.

Have fun!🤩

molpopgen · 2024-07-02T17:07:57Z

@bguo068 I think that this is almost ready. Can you please do the following:

rebase your PR against the current main branch.
Squash your commits into a single commit using conventional commit syntax. For example: feat: added keep intervals for tables and tree sequences. You can see how other commits look via git log. We use this syntax to automate changelog generation.

bguo068 · 2024-07-02T21:12:37Z

@bguo068 I think that this is almost ready. Can you please do the following:

rebase your PR against the current main branch.

Squash your commits into a single commit using conventional commit syntax. For example: feat: added keep intervals for tables and tree sequences. You can see how other commits look via git log. We use this syntax to automate changelog generation.

@molpopgen Thanks for the instructions. I've rebased my PR against the current main branch and squashed the commits into a single commit using the conventional commit syntax. Please let me know if any further adjustments are needed.

molpopgen · 2024-07-02T23:10:15Z

Thanks! I'll take one last look to check for any "red flags", but I do think that this can be merged soon.

molpopgen · 2024-07-03T01:58:03Z

I think there is one thing missing:

If the tables are simplified, we don't have a means of returning the node id map.
Without this map, some information important to client code can be lost.

I don't think that this is a problem for this PR, though. I'll try to work through the semantics in a later PR and before the next stable release.

molpopgen · 2024-07-03T14:24:36Z

@bguo068 I think that the solution to the simplification issue will be to separate trimming to intervals from simplification. Given that rust is a low level language, functions managing multiple operations like "retain AND simplify IF a user so desires" may not be the way to go.

I'll puzzle this out in a downstream PR like I said.

molpopgen · 2024-07-03T14:28:40Z

Going to merge this now. Thanks @bguo068 !!

bguo068 · 2024-07-03T18:45:25Z

@bguo068 I think that the solution to the simplification issue will be to separate trimming to intervals from simplification. Given that rust is a low level language, functions managing multiple operations like "retain AND simplify IF a user so desires" may not be the way to go.

I'll puzzle this out in a downstream PR like I said.

@molpopgen Thank you for all your guidance and support in making this happen. I look forward to contributing more features to this project in the future.

bguo068 force-pushed the main branch from 31d0ced to 9c7bed3 Compare May 29, 2024 04:53

bguo068 marked this pull request as ready for review May 29, 2024 05:14

bguo068 force-pushed the main branch from 9c7bed3 to 7755a02 Compare May 29, 2024 05:16

molpopgen requested changes May 29, 2024

View reviewed changes

molpopgen reviewed May 31, 2024

View reviewed changes

bguo068 force-pushed the main branch 2 times, most recently from e3c7009 to 2e5d9c5 Compare June 1, 2024 04:49

bguo068 requested a review from molpopgen June 1, 2024 05:16

molpopgen requested changes Jun 1, 2024

View reviewed changes

bguo068 force-pushed the main branch from 2b8a6de to ddfa98d Compare June 1, 2024 18:36

bguo068 force-pushed the main branch from ddfa98d to 64d33e6 Compare June 1, 2024 18:46

molpopgen reviewed Jun 1, 2024

View reviewed changes

src/test_fixtures.rs Show resolved Hide resolved

molpopgen reviewed Jun 2, 2024

View reviewed changes

src/test_fixtures.rs Show resolved Hide resolved

molpopgen reviewed Jun 2, 2024

View reviewed changes

bguo068 force-pushed the main branch from eec380f to 05c2e31 Compare June 3, 2024 04:33

molpopgen requested changes Jun 6, 2024

View reviewed changes

bguo068 force-pushed the main branch 3 times, most recently from e93ff0a to 63fff51 Compare June 7, 2024 02:21

molpopgen reviewed Jun 7, 2024

View reviewed changes

molpopgen reviewed Jun 11, 2024

View reviewed changes

feat: added keep intervals for tables and tree sequences

e30edc8

bguo068 force-pushed the main branch from 46c1325 to e30edc8 Compare July 2, 2024 20:58

molpopgen approved these changes Jul 2, 2024

View reviewed changes

molpopgen merged commit cb37378 into tskit-dev:main Jul 3, 2024
11 checks passed

This was referenced Jul 3, 2024

Implement truncation of tree sequences to specified genome intervals #615

Closed

Separate keep_intervals from simplify #647

Merged

implement keep_intervals method #635

implement keep_intervals method #635

Conversation

bguo068 commented May 29, 2024 • edited Loading

molpopgen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bguo068 commented May 29, 2024

molpopgen commented May 29, 2024

Choose a reason for hiding this comment

bguo068 May 31, 2024 • edited Loading

Choose a reason for hiding this comment

molpopgen commented May 31, 2024

bguo068 commented May 31, 2024

bguo068 commented Jun 1, 2024

molpopgen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bguo068 commented Jun 1, 2024

molpopgen commented Jun 1, 2024

molpopgen commented Jun 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bguo068 Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molpopgen commented Jul 2, 2024

bguo068 commented Jul 2, 2024

molpopgen commented Jul 2, 2024

molpopgen commented Jul 3, 2024

molpopgen commented Jul 3, 2024

molpopgen commented Jul 3, 2024

bguo068 commented Jul 3, 2024

bguo068 commented May 29, 2024 •

edited

Loading

bguo068 May 31, 2024 •

edited

Loading

bguo068 Jun 11, 2024 •

edited

Loading