This repository has been archived by the owner on Dec 12, 2024. It is now read-only.
Complete ARGs must keep *all* ancestors #61
jeromekelleher
started this conversation in
General
Replies: 1 comment
-
That all looks correct to me. My mental definition of "full ARG" during the msprime implementation was the smaller graph in which we forget about a site once the local MRCA has been reached. Clarifying that in the docs is a good idea - I agree that we shouldn't change anything in code. There are no implications for either likelihood calculation. Edges above local roots don't affect genetic diversity in the sample, so that sampling consistency + the Markov property of the ARG mean that we are effectively averaging them out entirely. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Consider the following ARG:
You'd assume that this resolves as:
However, if we use an approach based on Hudson's algorithm we actually resolve it like this:
The ARG clearly shows that 5 is an ancestor of 4 over the interval 0-1, but Hudson's algorithm ignores everything that happens before the root of a local tree. That's because once a local section of genome has fully coalesced, it's snipped out of the simulation and no longer tracked. Thus, if we want to fully resolve an ARG and retain all information in it, we can't directly use Hudson's algorithm, but must instead use a modification of it that doesn't throw away fully coalesced segments.
Also, if we want to simulate an ARG we can't use Hudson's algorithm either - even if we retain all of the nodes before a full coalescence happens, the simulation will be incomplete from the point that the first full coalescence occurs.
If we want to simulate a complete ARG that contains all the nodes and all relationships between fully coalesced trees and those that have not yet fully coalesced, we must keep the full history for all segments of the genome, and so we must keep simulating all segments until all segments have fully coalesced. Note this is not necessarily back to the GMRCA, so it's not as bad as the Big ARG, but it's certainly much worse than Hudson's algorithm.
It's another reason that an ARG is inherently less efficient to simulate than Hudson's algorithm as well - not only do we have to store all these nodes that nothing happens at within the local trees, we have to store them above the roots too. I'd imagine many of these nodes have no effect on any other local trees either, so they really are pointless to simulate.
It it is potentially useful to know that a particular ancestor that is an internal node in one local tree is also an ancestor of the root of another local tree, so I think the
resolve
algorithm must be faithful to this information.We should document that the
full_arg
option in msprime actually doesn't keep the full ARG, but doesn't record anything above the local roots (it's too much trouble to change that).Does this have any bearing on the likelihood calculations @JereKoskela?
Any thoughts?
Beta Was this translation helpful? Give feedback.
All reactions