Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sim_ancestry with initial state can produce invalid ts when mutations exist above the roots #2319

Open
ShyamieG opened this issue Aug 28, 2024 · 5 comments

Comments

@ShyamieG
Copy link

Hi all,

I'm running into a "TSK_ERR_MUTATION_TIME_OLDER_THAN_PARENT_NODE" error when trying to recapitate a truncated tree sequence. I've attached some code and a tree sequence to walk through the issue.

The original tree sequence can be recapitated without a problem, but I run into the error when I try to truncate my tree sequence first using the 'delete_older' method:

# truncate, then recapitate
max_age = 20000
ts_tables = tree_seq.dump_tables()
ts_tables.delete_older(max_age)
trunc_ts = ts_tables.tree_sequence()

msprime.sim_ancestry(initial_state=trunc_ts, recombination_rate = 1e-9, demography = demography, random_seed = 1234)

the above code gave me the following error:
Traceback (most recent call last):
File "", line 1, in
File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/msprime/ancestry.py", line 1226, in sim_ancestry
return _wrap_replicates(
File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/msprime/ancestry.py", line 660, in _wrap_replicates
deque = collections.deque(iterator, maxlen=1)
File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/msprime/ancestry.py", line 1518, in run_replicates
ts = tables.tree_sequence()
File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/tskit/tables.py", line 3341, in tree_sequence
return tskit.TreeSequence.load_tables(self)
File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/tskit/trees.py", line 4110, in load_tables
ts.load_tables(tables._ll_tables, build_indexes=build_indexes)
_tskit.LibraryError: A mutation's time must be < the parent node of the edge on which it occurs, or be marked as 'unknown'. (TSK_ERR_MUTATION_TIME_OLDER_THAN_PARENT_NODE)

I found this strange because I'm not simulating any new mutations during recapitation. I also checked for cases where a mutation has an older time than the parent node of its associated edge and found none:

# check for bad mutations
for mutation in trunc_ts.mutations():
	mut_time = mutation.time
	parent_node_time = trunc_ts.node(trunc_ts.edge(mutation.edge).parent).time
	if mut_time >= parent_node_time:
		print(mutation)

I went ahead and marked all my mutation times as unknown, as suggested by the error message. I don't really need mutation times for my application, so that is a viable workaround for me. However, I feel like this should not be happening and might indicate a bug. Any ideas what might be going on?

Thanks!

@ShyamieG
Copy link
Author

ShyamieG commented Aug 28, 2024

Sorry, I just realized that there are in fact two mutations that violate the rule. It seems they ended up on edges that no longer exist after I truncate my tree. I can simply remove them and it works.

Here is the function I wrote:

def truncate_tree(tree_seq, max_age):
	ts_tables = tree_seq.dump_tables()
	ts_tables.delete_older(max_age) # deletes mutations and edges
	# delete nodes that are too old
	too_old_nodes = set(np.where(tree_seq.nodes_time > max_age)[0])
	ts_tables.nodes.clear()
	for node in tree_seq.nodes():
		if node.id not in too_old_nodes:
			ts_tables.nodes.add_row(flags = node.flags, \
									time = node.time, \
									population = node.population, \
									individual = node.individual, \
									metadata = node.metadata)
	trunc_ts = ts_tables.tree_sequence()
	# delete mutations whose edge no longer exists
	ts_tables = trunc_ts.dump_tables()
	ts_tables.mutations.clear()
	for mutation in trunc_ts.mutations():
		if mutation.edge != -1:
			ts_tables.mutations.add_row(site = mutation.site, \
										node = mutation.node, \
										derived_state = mutation.derived_state, \
										parent = mutation.parent, \
										metadata = mutation.metadata)
	new_tree_seq = ts_tables.tree_sequence()
	return new_tree_seq

@bhaller
Copy link

bhaller commented Aug 28, 2024

Hmm, should this be closed? Having a mutation positioned on an edge that no longer exists sounds like the truncation operation has left the user with an invalid tree sequence, no? The mutation should either be removed, or re-positioned onto an edge that does still exist, no? I'm gonna reopen this for now, just to make sure this gets seen – although perhaps I'm wrong and there is no bug here.

@bhaller bhaller reopened this Aug 28, 2024
@petrelharp
Copy link
Contributor

Gee, thanks, @ShyamieG - this is an issue. I think that either

  1. this is an issue with msprime.sim_ancestry(initial_state=...), in that it needs to check for mutations above the roots and either deal with them properly (either put them on the right branches or error?), or
  2. we should decide that this is a User Beware situation and maybe add documentation.

My inclination is (1), and in fact it'd be a pretty easy thing to fix up (ie to put the mutations where they should go?) but I haven't thought about it hard. I'm going to transfer this to msprime.

@petrelharp petrelharp transferred this issue from tskit-dev/pyslim Sep 3, 2024
@petrelharp petrelharp changed the title mutation time error when recapitating sim_ancestry with initial state can produce invalid ts when mutations exist above the roots Sep 3, 2024
@petrelharp
Copy link
Contributor

petrelharp commented Sep 4, 2024

Tangentially, we have the code to do the needed operation over in tskit-dev/tskit#2938:

static int
tsk_treeseq_slide_mutation_nodes_up(
    const tsk_treeseq_t *self, tsk_mutation_table_t *mutations)
{
    int ret = 0;
    double t;
    tsk_id_t c, p, next_mut;
    const tsk_table_collection_t *tables = self->tables;
    const tsk_size_t num_nodes = tables->nodes.num_rows;
    double *sites_position = tables->sites.position;
    double *nodes_time = tables->nodes.time;
    tsk_tree_t tree;

    ret = tsk_tree_init(&tree, self, TSK_NO_SAMPLE_COUNTS);
    if (ret != 0) {
        goto out;
    }

    next_mut = 0;
    for (ret = tsk_tree_first(&tree); ret == TSK_TREE_OK; ret = tsk_tree_next(&tree)) {
        while (next_mut < (tsk_id_t) mutations->num_rows
               && sites_position[mutations->site[next_mut]] < tree.interval.right) {
            t = mutations->time[next_mut];
            if (tsk_is_unknown_time(t)) {
                ret = TSK_ERR_DISALLOWED_UNKNOWN_MUTATION_TIME;
                goto out;
            }
            c = mutations->node[next_mut];
            tsk_bug_assert(c < (tsk_id_t) num_nodes);
            p = tree.parent[c];
            while (p != TSK_NULL && nodes_time[p] <= t) {
                c = p;
                p = tree.parent[c];
            }
            tsk_bug_assert(nodes_time[c] <= t);
            mutations->node[next_mut] = c;
            next_mut++;
        }
    }
    if (ret != 0) {
        goto out;
    }

out:
    tsk_tree_free(&tree);

    return ret;
}

and python code:


def _slide_mutation_nodes_up(ts, mutations):
    # adjusts mutations' nodes to place each mutation on the correct edge given
    # their time; requires mutation times be nonmissing and the mutation times
    # be >= their nodes' times.

    assert np.all(~tskit.is_unknown_time(mutations.time)), "times must be known"
    new_nodes = mutations.node.copy()

    mut = 0
    for tree in ts.trees():
        _, right = tree.interval
        while (
            mut < mutations.num_rows and ts.sites_position[mutations.site[mut]] < right
        ):
            t = mutations.time[mut]
            c = mutations.node[mut]
            p = tree.parent(c)
            assert ts.nodes_time[c] <= t
            while p != -1 and ts.nodes_time[p] <= t:
                c = p
                p = tree.parent(c)
            assert ts.nodes_time[c] <= t
            if p != -1:
                assert t < ts.nodes_time[p]
            new_nodes[mut] = c
            mut += 1

    # in C the node column can be edited in place
    new_mutations = mutations.copy()
    new_mutations.clear()
    for mut, n in zip(mutations, new_nodes):
        new_mutations.append(mut.replace(node=n))

    return new_mutations

@petrelharp
Copy link
Contributor

petrelharp commented Oct 2, 2024

Someone else has hit this, FWIW: tskit-dev/pyslim#352 - tldr; the input trees were simplified (they shouldn't have been, or should have had keep_input_roots=True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants