-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement keep_intervals method #635
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I've made some review contents based on a first pass. The most important thing is the testing. The tests need to be totally self-contained in rust.
testdata/gen_trees.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a different way to generate the test data. We cannot have dependencies on Python, and we don't want trees files committed to the repository. To make test data, use low-level operations in rust functions that are #[cfg(test)]
.
Beyond keeping developers sane, we also don't want to run into situations where the Python packages using tskit are using a different version of the C library than tskit-rust. If that were to happen, it becomes possible that we have ABI problems when trying to load the files.
src/table_collection.rs
Outdated
/// Note that no new provenance will be appended. | ||
pub fn keep_intervals( | ||
&mut self, | ||
intervals: impl Iterator<Item = (Position, Position)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the type of intervals
to impl Iterator<Item = (P, P)>
where P: Into Position
. would improve the ergonomics. If P
were Position
, the .into()
would optimize out.
src/table_collection.rs
Outdated
} | ||
|
||
// build mutation_map | ||
let mutation_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could do this using a map/collect
idiom instead of allocating the vector internally?
src/table_collection.rs
Outdated
/// | ||
/// Note that no new provenance will be appended. | ||
pub fn keep_intervals( | ||
&mut self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is useful to think through if this fn should modify in place or consume self. There is no reason to follow the conventions of the Python API here. Consider the case of an error: if the function takes & mut self
, then it is possible that the object is left in an invalid internal state, meaning that meaningfully continuing is not possible. When that is the case, one could argue that it is better to consume self
.
Further, consuming self
allows for better use of method chaining.
(In general, I think too much of the rust API works on &mut self
and not self
.)
Note that you can have the function accept mut self
to consume it. IMO, though, it is a bit more intuitive to have it consume self, and then let mut tables = self
inside the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point on error causing invalide interval state! I change it accordingly
Thanks for the comments. I think they all make sense to me. I thought about the same thing regarding the test data but have not had a good idea of how to generate them manually using rust code. If you have some suggestions, it would be great. If not, I will look into the code base of rust/c versions of tskit for the conventions. |
There are two ways to go for building test data:
Both are useful and give different ways of stress-testing the machinery. To make the tests really useful, one would also want a naive implementation of the algorithm defined in a rust test file. Here, performance doesn't matter and we should be able to assert equality and/or equivalence between the two outputs. |
src/table_collection.rs
Outdated
/// # Example | ||
/// ```rust | ||
/// use tskit::TreeSequence; | ||
/// let mut tables = TreeSequence::load("./testdata/1.trees") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the doc tests, you can hide setup code. See here. Doing this lets you only show the reader the function call itself. The hidden code can contain assertions, etc., after it is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good suggestion!
Currently, I am working on a simple simulation method following the tutorial on tskit c api webpage. I will make sure I follow this suggestion for doc test.
@bguo068 what happens if one tries to keep intervals for which there are no data? Imagine a table collection with a sequence length of L, yet no edges end at L. Then you keep edges only for the end where there is no data. While the data model allows empty table collections and tree sequences, one could argue that returning None for this case makes things more obvious? |
Yep. I can check the edge table at the end of keep_intervals method and return Ok(None) when edge table is empty. |
e3c7009
to
2e5d9c5
Compare
@molpopgen Many thanks for your guidance. I have added simulation code to generate trees that can be used to validate the To generate trees in both doc tests and test modules under the If this organization is not ideal, could you please advise on a better way to structure the code? Or we do not need generate trees for doc tests of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is moving in the right direction. The primary issue here is that testing code has been added to the public API. We cannot have this -- testing code should be entirely internal.
I have not taken a big picture overview of the PR yet. I will do so this week.
Cargo.toml
Outdated
@@ -27,6 +27,8 @@ serde_json = {version = "1.0.114", optional = true} | |||
bincode = {version = "1.3.1", optional = true} | |||
tskit-derive = {version = "0.2.0", path = "tskit-derive", optional = true} | |||
delegate = "0.12.0" | |||
rand = "0.8.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rand is already a dev dependency.
src/lib.rs
Outdated
@@ -177,4 +177,5 @@ mod tests { | |||
} | |||
|
|||
// Testing modules | |||
pub mod test_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use test_fixtures and do not make testing modules pub -- they are not part of the tskit API.
src/table_collection.rs
Outdated
@@ -32,6 +32,7 @@ use crate::TskReturnValue; | |||
use crate::{EdgeId, NodeId}; | |||
use ll_bindings::tsk_id_t; | |||
use ll_bindings::tsk_size_t; | |||
use streaming_iterator::StreamingIterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a style point, we may want to have this use statement local to the functions that need it. If/when we modernize ourselves away from this dependency, it'll be easier to find.
src/table_collection.rs
Outdated
/// | ||
/// # Example | ||
/// ```rust | ||
/// use tskit::TreeSequence; | ||
/// let mut tables = TreeSequence::load("./testdata/1.trees") | ||
/// # use tskit::test_data::simulation::simulate_two_treesequences; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are going for here. Doc tests only have access to the public api, but testing modules must not be public. Here, you need to be totally self-contained because we won't have a public API to do this.
src/table_collection.rs
Outdated
/// # let popsplit_time = 10; | ||
/// # let seed = 123; | ||
|
||
/// # let (full_trees, _exepected) = simulate_two_treesequences( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- we don't want to provide a public API for this.
src/test_data.rs
Outdated
@@ -0,0 +1,373 @@ | |||
/// mimic the c simulate function in tskit c api document | |||
/// https://tskit.dev/tskit/docs/stable/c-api.html#basic-forwards-simulator | |||
pub mod simulation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move all of this to test_fixtures.rs
so that it is not pub
outside of tskit.
src/trees/treeseq.rs
Outdated
/// use tskit::TreeSequence; | ||
/// let mut ts = TreeSequence::load("testdata/1.trees").expect("error loading ts"); | ||
/// let new_ts = ts.keep_intervals(vec![(10.0.into(), 130.0.into())].into_iter(), true).unwrap(); | ||
/// # use tskit::test_data::simulation::simulate_two_treesequences; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here -- this will not be pub.
@molpopgen I have tried to fix issues related to test code being public as api. Hope now its better. If there is anything else to be improved, please comment. Thanks. |
Thanks! I'll take a look later this week. |
The next step on my end will be to pull this branch and play around a bit. I will try to do that in the latter half of the coming week. |
src/test_fixtures.rs
Outdated
for intervals in intervals_lst { | ||
// an empty tablecollection is enough here | ||
let mut tables = TableCollection::new(100.0).unwrap(); | ||
tables.build_index().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a proper test? By starting with an empty tree sequence, there is no chance of, say, lifting over the same edge 2x? I would have expected the index building and/or the validate method to set an error in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep_intervals
checks the validity of the intervals before it modifying table collection. I thought adding a few edges and nodes would not change the outcome of the tests. But I can add a few edges and nodes in case of some tskit errors that I am not aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Probably no need for an extra test then. Once I'm back to work later this week, I'll take a deeper dive.
I may add a second test that uses proptest to generate random data. Those tests are nice for operations like this.
.lending_iter() | ||
.filter(|mrow| !!((mrow.right <= s) || (mrow.left >= e))); | ||
|
||
while let Some(migration_row) = migration_iter.next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests do not cover this block. It may be useful to come up with something to cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left migration table recording out of simulation as it does not work well with simplification as we can see from tskit C api docs for tsk_table_collection_simplify
(although the keep_intervals
python code deal with migration table):
Note
Migrations are currently not supported by simplify, and an error will be raised if we attempt call simplify on a table collection with greater than zero migrations. See tskit-dev/tskit#20
I can add a note to the Rust keep_intervals
docs to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. One could envision a test of manually-generated data that don't require sorting. But the reality is that that block of code is adding data already present in the tables. The only way for there to be an error is if someone loaded a table collection with invalid row data generated by another tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I will make a test that if some one try to call keep_intervals on treeseq that has nonempty migration and set simplify=true
in the argument, it should return a tskit error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before implementing a manual check, just do the test. tskit-c will, I believe, set an error code for that case, so you can just ?
on the simplify call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before implementing a manual check, just do the test. tskit-c will, I believe, set an error code for that case, so you can just
?
on the simplify call.
Sorry I am not sure I understand your suggestion. Did you mean to do test on the method that generates a treesequence with non-empty migration table in src/test_fixtures.rs
generate_simple_treesequence
function ?
I thought I already used ?
for the simplfy
call in the keep_intervals
method for TableCollection
. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I misunderstood your comment re: testing. I think that your tests are actually okay except for the issue of adding the migrations after simplification.
e93ff0a
to
63fff51
Compare
tables.full_sort(TableSortOptions::all()).unwrap(); | ||
tables.simplify(&[child1, child2], sim_opts, false).unwrap(); | ||
|
||
// add migration records after simplification to avoid errors when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm -- this could be the wrong thing to do?? I worry here that what is being added are unsimplified migrations to a simplified table? Further, you are not taking advantage of the output node ID map from simplification to do things like remap the child node IDs.
I think it is simpler and more correct for now to simply do this before simplification and let tskit-c raise an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm -- this could be the wrong thing to do?? I worry here that what is being added are unsimplified migrations to a simplified table? Further, you are not taking advantage of the output node ID map from simplification to do things like remap the child node IDs.
I think it is simpler and more correct for now to simply do this before simplification and let tskit-c raise an error.
I know it is not the best way to get some migration records in a treesequence but I just want to add something on the migration table to trigger error for calling simplify
within the implementation of the keep_interval
method.
If I add the migrtation records before simplifcation and let tskit c api raise an error, I wont be able to get a treesequence that has non-empty migration table that is later used to trigger error when calling keep_intervals
.
Maybe you want me to remove the whole test_keep_intervals_nonempty_migration_table
tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two problems that I see:
- The migration spans are incorrect and we cannot validate/check that.
- The node ids for the migrations are incorrect.
So if we want to keep this test (and I think we probably do), then you should at least remap the node ids to be correct using the output off simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I just updated the code. Hopefully it is better now.
src/test_fixtures.rs
Outdated
|
||
// add migration records after simplification to avoid errors when | ||
// simplifying a treesequence that contains a nonempty migration table | ||
if add_migration_records { | ||
let pop_anc = tables.add_population().unwrap(); | ||
let pop_1 = tables.add_population().unwrap(); | ||
let pop_2 = tables.add_population().unwrap(); | ||
// get new ids after simplifcation | ||
let child1 = id_map[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable here to replace 0
with child1.as_usize()
. Same for [1]
in the following line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that! I was thinking that the return idmap
has the same length as samples: &[NodeId]
. It actually has the length of self.nodes().num_rows()
before the tablecollection or treesequence is modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right -- the behavior allows you to update node-id-based data structures that are not managed by a table collection. This is very useful in practice.
I'm about to go camping and then on holiday. More when I return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have fun!🤩
@bguo068 I think that this is almost ready. Can you please do the following:
|
@molpopgen Thanks for the instructions. I've rebased my PR against the current main branch and squashed the commits into a single commit using the conventional commit syntax. Please let me know if any further adjustments are needed. |
Thanks! I'll take one last look to check for any "red flags", but I do think that this can be merged soon. |
I think there is one thing missing:
I don't think that this is a problem for this PR, though. I'll try to work through the semantics in a later PR and before the next stable release. |
@bguo068 I think that the solution to the simplification issue will be to separate trimming to intervals from simplification. Given that rust is a low level language, functions managing multiple operations like "retain AND simplify IF a user so desires" may not be the way to go. I'll puzzle this out in a downstream PR like I said. |
Going to merge this now. Thanks @bguo068 !! |
@molpopgen Thank you for all your guidance and support in making this happen. I look forward to contributing more features to this project in the future. |
This PR is aimed to address issue #615. A
keep_intervals
method is implemented forTableCollection
and bothTreeSequence
structs (the latter is a wrapper of the former). The implementation is done intskit::
rather thantskit::sys::
as the existing bindings seem to be adequate for this.There are related methods, such as
TreeSequence.trim()
, that could be nice to add in the future, if this type of PR is interested.Thanks!