Add selection test p-values #3205
Replies: 13 comments
-
Hi @a-ignatieva, Maybe I've missed something, but it looks like your function counts the number of nodes later than a node rather than the lineages? |
Beta Was this translation helpful? Give feedback.
-
Ah great, thanks @benjeffery - I figured this probably wasn't a new thought! I assumed there are no unary nodes and a binary tree, so just seemed like a shortcut to count nodes that are older than the given node (as each one creates one additional lineage). That works right? I thought it'd best to go by nodes rather than times (the Relate test at least needs the number of lineages just above a node) - and something like this should work with polytomies I think: def get_num_lineages_in_tree(tree):
results = {}
count = 1
for node in tree.nodes(order="timedesc"):
if tree.is_sample(node):
results[node] = tree.num_samples()
else:
results[node] = count
count += tree.num_children(node) - 1
return results |
Beta Was this translation helpful? Give feedback.
-
This is great, thanks @a-ignatieva. To get an idea of how to structure this so we can compute it as efficiently as we can, would we usually want to do this for all mutations? So, we'd have a TreeSequence method like def selection_test_pvalue(self):
""""
Return an array of P-values under the Relate selection test for each mutation in the tree sequence.
""" or would we usually hand-pick specific mutations? |
Beta Was this translation helpful? Give feedback.
-
Clever! Almost all tree sequences from (for example) |
Beta Was this translation helpful? Give feedback.
-
@jeromekelleher I think definitely both, ideally! I'm looking to do the former, but I can also imagine wanting to do the latter for specific mutations.
@benjeffery I think this would be really great to have! |
Beta Was this translation helpful? Give feedback.
-
I don't think we should assume binary trees here for the statistic. I guess it would not add much to the efficiency to check on number of children of a node in the tree, though. |
Beta Was this translation helpful? Give feedback.
-
Agreed - this would make it basically useless for tsinfer trees. |
Beta Was this translation helpful? Give feedback.
-
Picking this one up again, here's a proposal for what a single mutation implementation would look like (not that we have the def relate_test_pvalue(self, mutation_id):
mut = self.mutation(mut)
site = self.site(mut.site)
tree = self.at(site.position)
time = self.node(mut.mutations).time
N = self.num_samples
k = 1 + tree.num_lineages(time)
fn = tree.num_samples(mut.node)
p = 0.0
for f in range(fn, N - k + 3):
p += (f - 1) * math.comb(N - f - 1, k - 3) / math.comb(N - 1, k - 1)
return p Is this correct when we have polytomies? It would be good to get this in to the library, and I'm happy to push through the coding if we can agree on an concrete definition to implement. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Thanks @a-ignatieva , this is super helpful (especially the diagram!)
Good call, I completely agree |
Beta Was this translation helpful? Give feedback.
-
So, I think we can just state that we assume the branching structure represents the true history, so in your example above k=9. If people are interested in a strictly binary model, then they need to use a different inference method, or somehow reconcile the non-binary inference with the trees themselves (e.g., by randomly generating binary trees using split_polytomies, or whatever). Note that in your example above just generating a random binary tree in the righthand subtree is of little use, because you also need branch lengths, and say, splitting equally would be quite unlikely under most population models. I'm pretty comfortable with having polytomies in cases like this - they really do exist, sometimes (e.g., see the sc2 trees!) |
Beta Was this translation helpful? Give feedback.
-
I'd just worry that a user might not appreciate these subtleties when running this test on their trees - but maybe just there should be a clear description/warning in the documentation. I think you have to be careful with the test in the actual multiple merger case. In the binary case we consider the tree just below the first split (at the dashed line in the pics above), and we have Also maybe better to replace the math.combs with log sums to avoid large N issues? Like p = 0.0
for f in range(fn, N - k + 3):
if f == 2:
p += 1.0 / ((N-1) * (N-2))
else:
p1 = np.sum(np.log(range(N-k-f+3, N-k+1)))
p2 = np.sum(np.log(range(N-f, N)))
p += (f-1) * np.exp(p1 - p2)
p = p * (k-1) * (k-2) |
Beta Was this translation helpful? Give feedback.
-
Hmmm - so the fundamental question then I guess is whether this is well defined for non binary trees at all, and whether we should just raise an error if the tree is non binary. Put another way - if we assume that we have something like tsinfer trees where the non binary nodes are from a mixture of uncertainty and genuine multiple mergers (in the large sample size case), would the result be a rough approximation of the truth, or just misleading? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
It would be nice to have a function to calculate the selection test described in the Relate paper (p.16 of supplement) for mutations in a given tree. Or perhaps this is something that could be added to the documentation as an example, it's easy to code up using existing functions. This is the code I have (and an example).
Beta Was this translation helpful? Give feedback.
All reactions