Finding span of a node across trees (and average arity) #2718
-
I would like to find the average arity of each node over the entire tree sequence. I think I can do this by simply adding up the span of each edge with that node as a parent, and dividing by the span over which that node is present in any tree (the "node span"). Because edges are sorted by parent ID, we can do the first part pretty efficiently in numpy (this takes 12.4 ms on the huge Covid ARG we have) import msprime
import numpy as np
ts = msprime.simulate(10, recombination_rate=1, random_seed=1)
span_sums = np.zeros(ts.num_nodes)
# Find the edge indices where parents change
i = np.insert(np.nonzero(np.diff(ts.edges_parent))[0] + 1, 0, 0)
span_sums[ts.edges_parent[i]] = np.add.reduceat(ts.edges_right - ts.edges_left, i) But now I need to divide by the span over which the node is present in any tree. I can't think of an easy way to get this: I guess it might involve counting over edge diffs or using an interval library on the edges. It would be helpful to have a recipe to do this, as it's a fairly fundamental quantity, I think. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 7 replies
-
This is (obviously) the slow way to do it. I'm looking for a more efficient incremental solution node_spans = np.zeros(ts.num_nodes)
for tree in ts.trees():
for u in tree.nodes():
node_spans[u] += tree.span |
Beta Was this translation helpful? Give feedback.
-
I think this does it: import msprime
import numpy as np
def span_definition(ts):
node_spans = np.zeros(ts.num_nodes)
for tree in ts.trees():
for u in tree.nodes():
node_spans[u] += tree.span
return node_spans
def span(ts):
num_children = np.zeros(ts.num_nodes, dtype=np.int32)
span_start = np.zeros(ts.num_nodes)
node_span = np.zeros(ts.num_nodes)
for interval, edges_out, edges_in in ts.edge_diffs(include_terminal=True):
for edge in edges_out:
num_children[edge.parent] -= 1
if num_children[edge.parent] == 0:
node_span[edge.parent] += interval.left - span_start[edge.parent]
for edge in edges_in:
if num_children[edge.parent] == 0:
span_start[edge.parent] = interval.left
num_children[edge.parent] += 1
# Set the sample spans afterwards, so internal samples are handled correctly
node_span[ts.samples()] = ts.sequence_length
return node_span
ts = msprime.simulate(10, recombination_rate=1, random_seed=1)
span1 = span_definition(ts)
span2 = span(ts)
np.testing.assert_array_almost_equal(span1, span2) Not tested though! |
Beta Was this translation helpful? Give feedback.
I think this does it: