Finding span of a node across trees (and average arity) #2718

hyanwong · 2023-03-01T09:15:36Z

hyanwong
Mar 1, 2023
Maintainer

I would like to find the average arity of each node over the entire tree sequence. I think I can do this by simply adding up the span of each edge with that node as a parent, and dividing by the span over which that node is present in any tree (the "node span"). Because edges are sorted by parent ID, we can do the first part pretty efficiently in numpy (this takes 12.4 ms on the huge Covid ARG we have)

import msprime
import numpy as np

ts = msprime.simulate(10, recombination_rate=1, random_seed=1)

span_sums = np.zeros(ts.num_nodes)
# Find the edge indices where parents change
i = np.insert(np.nonzero(np.diff(ts.edges_parent))[0] + 1, 0, 0)
span_sums[ts.edges_parent[i]] = np.add.reduceat(ts.edges_right - ts.edges_left, i)

But now I need to divide by the span over which the node is present in any tree. I can't think of an easy way to get this: I guess it might involve counting over edge diffs or using an interval library on the edges. It would be helpful to have a recipe to do this, as it's a fairly fundamental quantity, I think.

Answered by jeromekelleher

Mar 2, 2023

I think this does it:

import msprime
import numpy as np

def span_definition(ts):
    node_spans = np.zeros(ts.num_nodes)
    for tree in ts.trees():
        for u in tree.nodes():
            node_spans[u] += tree.span
    return node_spans

def span(ts):
    num_children = np.zeros(ts.num_nodes, dtype=np.int32)
    span_start = np.zeros(ts.num_nodes)
    node_span = np.zeros(ts.num_nodes)
    
    for interval, edges_out, edges_in in ts.edge_diffs(include_terminal=True):
        for edge in edges_out:
            num_children[edge.parent] -= 1
            if num_children[edge.parent] == 0:
                node_span[edge.parent] += interval.left - span_start[edge.parent]
        for edge in

View full answer

hyanwong · 2023-03-01T09:17:44Z

hyanwong
Mar 1, 2023
Maintainer Author

This is (obviously) the slow way to do it. I'm looking for a more efficient incremental solution

node_spans = np.zeros(ts.num_nodes)
for tree in ts.trees():
    for u in tree.nodes():
        node_spans[u] += tree.span

2 replies

hyanwong Mar 1, 2023
Maintainer Author

(and the original, average arity calc can be done inefficiently like this:

arity = np.zeros(ts.num_nodes)
for tree in ts.trees():
    for u in tree.nodes():
        arity[u] += tree.num_children(u) * tree.span
# Now divide by total node span

This takes over 1/2 hour to run on my laptop on the 650K sample "long" Covid ARG, so an incremental solution would definitely be useful)

hyanwong Mar 1, 2023
Maintainer Author

Having said which, using the tree arrays is much faster. This takes 1.5 mins on the "long" Covid ARG. It's still not in the millisecond range, though.

arity = np.zeros(ts_long.num_nodes)
node_spans = np.zeros(ts.num_nodes)
for tree in tqdm.tqdm(ts.trees()):
    po = tree.postorder()
    arity[po] += tree.num_children_array[po] * tree.span
    node_spans[po] += tree.span
arity /= node_spans

jeromekelleher · 2023-03-02T11:01:36Z

jeromekelleher
Mar 2, 2023
Maintainer

I think this does it:

import msprime
import numpy as np

def span_definition(ts):
    node_spans = np.zeros(ts.num_nodes)
    for tree in ts.trees():
        for u in tree.nodes():
            node_spans[u] += tree.span
    return node_spans

def span(ts):
    num_children = np.zeros(ts.num_nodes, dtype=np.int32)
    span_start = np.zeros(ts.num_nodes)
    node_span = np.zeros(ts.num_nodes)
    
    for interval, edges_out, edges_in in ts.edge_diffs(include_terminal=True):
        for edge in edges_out:
            num_children[edge.parent] -= 1
            if num_children[edge.parent] == 0:
                node_span[edge.parent] += interval.left - span_start[edge.parent]
        for edge in edges_in:
            if num_children[edge.parent] == 0:
                span_start[edge.parent] = interval.left
            num_children[edge.parent] += 1
    # Set the sample spans afterwards, so internal samples are handled correctly
    node_span[ts.samples()] = ts.sequence_length
    return node_span

ts = msprime.simulate(10, recombination_rate=1, random_seed=1)

span1 = span_definition(ts)
span2 = span(ts)
np.testing.assert_array_almost_equal(span1, span2)

Not tested though!

5 replies

hyanwong Mar 2, 2023
Maintainer Author

Thanks. That's really useful. I'll give it a whirl. I wonder if somehow we can have a builtin function to do this, as it's pretty fundamental. Perhaps too much maintenance, though?

jeromekelleher Mar 2, 2023
Maintainer

No, seems like a useful operation. We actually do it quite a lot in stats calculations, it's just hidden away behind the scenes.

hyanwong Mar 2, 2023
Maintainer Author

Yep, it seems silly to reimplement it every time I need it (which as you say, is quite often). Any thoughts as to how to make it available in the API?

jeromekelleher Mar 2, 2023
Maintainer

Not really - would take time to think through properly.

I guess a TreeSequence.node_spans() method, returning the array computed above would be the most obvious place to start. I not feeling particularly motivated to actually implement though.

hyanwong Mar 2, 2023
Maintainer Author

Sure - not high priority, but I have reimplemented this a few times and keep forgetting how, so will put it on the stack of things to do. I wonder if @benjeffery has suggestions for the API?

Finding span of a node across trees (and average arity) #2718

Uh oh!

Uh oh!

hyanwong Mar 1, 2023 Maintainer

Replies: 2 comments · 7 replies

Uh oh!

hyanwong Mar 1, 2023 Maintainer Author

Uh oh!

Uh oh!

hyanwong Mar 1, 2023 Maintainer Author

Uh oh!

Uh oh!

hyanwong Mar 1, 2023 Maintainer Author

Uh oh!

jeromekelleher Mar 2, 2023 Maintainer

Uh oh!

hyanwong Mar 2, 2023 Maintainer Author

Uh oh!

jeromekelleher Mar 2, 2023 Maintainer

Uh oh!

hyanwong Mar 2, 2023 Maintainer Author

Uh oh!

jeromekelleher Mar 2, 2023 Maintainer

Uh oh!

hyanwong Mar 2, 2023 Maintainer Author

hyanwong
Mar 1, 2023
Maintainer

Replies: 2 comments 7 replies

hyanwong
Mar 1, 2023
Maintainer Author

hyanwong Mar 1, 2023
Maintainer Author

hyanwong Mar 1, 2023
Maintainer Author

jeromekelleher
Mar 2, 2023
Maintainer

hyanwong Mar 2, 2023
Maintainer Author

jeromekelleher Mar 2, 2023
Maintainer

hyanwong Mar 2, 2023
Maintainer Author

jeromekelleher Mar 2, 2023
Maintainer

hyanwong Mar 2, 2023
Maintainer Author